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Abstract 

We develop a new active learning algorithm for the streaming setting satisfying three important properties: 1) It 
provably works for any classifier representation and classification problem including those with severe noise. 2) It 
is efficiently implementable with an ERM oracle. 3) It is more aggressive than all previous approaches satisfying 1 
and 2. To do this we create an algorithm based on a newly defined optimization problem and analyze it. We also 
conduct the first experimental analysis of all efficient agnostic active learning algorithms, evaluating their strengths 
and weaknesses in different settings. 


1 Introduction 


How can you best learn a classifier given a label budget? 

Active learning approaches are known to yield exponential improvements over supervised learning under strong 
assumptions Cohn et al. 1994). Under much weaker assumptions, streaming-based agnostic active learning | |Balcan 


et al. 2006 Beygelzimer et al. 2009 2010| Dasgupta et al. 2007 |Zhang and Chaudhuri 20141 is particularly ap 
pealingsince it is known to work for any classifier representation and any label noise distribution with an i.i.d. data 
source|J Here, a learning algorithm decides for each unlabeled example in sequence whether or not to request a label, 
never revisiting this decision. Restated then: What is the best possible active learning algorithm which works for any 
classifier representation, any label noise distribution, and is computationally tractable? 

Computational tractability is a critical concern, because most known algorithms for this setting [e.g., Balcan 


et al.| 2006 Koltchinskii] |2010~ Zhang and Chaudhuri 2014| require explicit enumeration of classifiers, implying 
exponentially-worse computational complexity compared to typical supervised learning algorithms. Active learning 
algorithms based on empirical risk minimization (ERM) oracles |Beygelzimer et al., 2009 2010 Hsu, 20101 can over¬ 
come this intractability by using passive classification algorithms as the oracle to achieve a computationally acceptable 
solution. 


Achieving generality, robustness, and acceptable computation has a cost. For the above methods | Beygelzimer 


et al. 2009) 2010[ Hsu 2010) , a label is requested on nearly every unlabeled example where two empirically good 
classifiers disagree. This results in a poor label complexity, well short of information-theoretic limits (Castro and| 
Nowak 2008| even for general robust solutions |Zhang and Chaudhuri 20141. Until now. 

In Section [3] we design a new algorithm ACTIVE COVER (AC) for constructing query probability functions 
that minimize the probability of querying inside the disagreement region —the set of points where good classifiers 
disagree—and never query otherwise. This requires a new algorithm that maintains a parsimonious cover of the set 
of empirically good classifiers. The cover is a result of solving an optimization problem (in Section[5]i specifying the 
properties of a desirable query probability function. The cover size provides a practical knob between computation 
and label complexity, as demonstrated by the complexity analysis we present in Section [5] 

In Section[4] we provide our main results which demonstrate that AC effectively maintains a set of good classifiers, 
achieves good generalization error, and has a label complexity bound tighter than previous approaches. The label 
complexity bound depends on the disagreement coefficient | Hannekej 2009) , which does not completely capture the 
advantage of the algorithm. In Section [4.2. 2[ we provide an example of a hard active learning problem where AC is 


See the monograph of Hanneke 120141 for an overview of the existing literature, including alternative settings where additional assumptions 
are placed on the data source (e.g., separability) as is common in other works (Dasgupta 2005 Balcan et al.|[2007 Balcan and Long, 201 3|. 
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substantially superior to previous tractable approaches. Together, these results show that AC is better and sometimes 
substantially better in theory. The key aspects in the proof of our generalization results are presented in Section]?] with 
more technical details and label complexity analysis presented in the appendix. 

Do agnostic active learning algorithms work in practice? No previous works have addressed this question em¬ 
pirically. Doing so is important because analysis cannot reveal the degree to which existing classification algorithms 
effectively provide an ERM oracle. We conduct an extensive study in Section [6] by simulating the interaction of the 
active learning algorithm with a streaming supervised dataset. Results on a wide array of datasets show that agnostic 
active learning typically outperforms passive learning, and the magnitude of improvement depends on how carefully 
the active learning hyper-parameters are chosen. 


2 Preliminaries 


Let H C {±1} A be a set of binary classifiers, which we assume is finite for simplicity]^] Let Ex[-] denote ex¬ 
pectation with respect to A' ~ Px, the marginal of P over X. The expected error of a classifier h £ ft is 
err (h) := Pr(x,r)~p(M?0 ^ Y), and the error minimizer is denoted by h* := argmin^g^ err (h). The (impor¬ 
tance weighted) empirical error of h £ H on a multiset S of importance weighted and labeled examples drawn 
from X x {±1} x R + is err (h,S) := ^(x y w)eS w ' 7^ j/)/|Sj. The disagreement region for a subset of 

classifiers A C % is DIS(A) := {x £ X \ 3h,h' £ A such that h(x) h'(x)}. The regret of a classifier 
h £ H relative to another h! £ H is reg (h, h') := err (h) — err (h!), and the analogous empirical regret on S is 
reg (h, h', S ) := err(/i, S) — err (h!, S). When the second classifier h' in (empirical) regret is omitted, it is taken to be 
the (empirical) error minimizer in H. 

A streaming-based active learner receives i.i.d. labeled examples (A’i, Yr), ( X- 2 , Y- 2 ).... from P one at a time; 
each label Y, t is hidden unless the learner decides on the spot to query it. The goal is to produce a classifier h £ 'H 
with low error err (h), while querying as few labels as possible. 

In the IWAL framework |Beygelzimer et af; [ 20Q9] |, a decision whether or not to query a label is made randomly: 
the learner picks a probability p £ [0,1], and queries the label with that probability. Whenever p > 0, an unbiased 
error estimate can be produced using inverse probability weighting |Horvitz and Thompson 19521. Specifically, for 
any classifier h, an unbiased estimator E of err (h) based on (A, Y) ~ P and p is as follows: if Y is queried, then 
E = 1 (h(X) ^ Y)/p\ else, E = 0. It is easy to check that K(E) = err (h). Thus, when the label is queried, we 
produce the importance weighted labeled example (X, Y, l/p)0 


3 Algorithm 

Our new algorithm, shown in Algorithm |T] breaks the example stream into epochs. The algorithm admits any 
epoch schedule so long as the epoch lengths satisfy r m _i < 2r m . For technical reasons, we always query the first 3 
labels to kick-start the algorithm. At the start of epoch m, AC computes a query probability function P m : X —> [0,1] 
which will be used for sampling the data points to query during the epoch. This is done by maintaining a few objects 
of interest during each epoch: 

1. In step [T] we compute the best classifier on the sample Z m that we have collected so far. Note that the sample 
consists of the queried, true labels on some examples, while predicted labels for the others. 

A radius A m is computed in step[2]based on the desired level of concentration we want the various empirical 
quantities to satisfy. 

The set A m+ 1 in step [ 3 ] consists of all the hypotheses which are good according to our sample Z m , with the 
notion of good being measured as empirical regret being at most A m . 

-The assumption that 'H is finite can be relaxed to VC-classes using standard arguments. 

3 If the label is not queried, we produce an ignored example of weight zero; its only purpose is to maintain the correct count of querying 
opportunities. This ensures that 1/|>S| is the correct normalization in err (h, S). 

4 See Footnote]^ Adding an example of importance weight zero simply increments \S\ without updating other state of the algorithm, hence the 
label used does not matter. 
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Algorithm 1 Active Cover (AC) 

input: Constants ci, C 2 , C 3 , confidence S, error radius 7 , parameters a, 3. £ for (op), epoch schedule 0 = To < 3 = 
n <t 2 <t 3 < ... <J M satisfying r m+1 < 2 r m for m > 1 . 
initialize: epoch m = 0 , Z 0 := 0, A 0 := Ciy/cT + C 2 fii log 3, where 

32(log(|'H|/<5) + logr m ) 

• — 

I'm 


l: for i = 4,..., n, do 

2: if * = r m + 1 then 

3: Set Z m = Z m - 1 U S', and S = 0. 

4: Let 


' 771+1 

:= argmin err (h.Z m ), 

( 1 ) 


hen 


A m 

:= ci\ 

j e m err(/i m +i, H - 

( 2 ) 

- 771+1 

:= {h 

1 err (ft, Z m ) - err (h m+1 ,Z m ) < 7 A m }. 

(3) 


5: Compute the solution P m + i(-) to the optimization problem 0. 

6 : m := to + 1 . 

7: end if 

B: Receive unlabeled data point X t . 

9: if X, : e D m := DIS(A m ), then 

10: Draw Qj ~ Bernoulli(P rn (X i )). 

11: Update the set of examples]^] 


su{(x i ,y i ,i/p m (x i ))}, Qi = 1 

S U {X,;, 1,0}, otherwise. 


12: else 

S := S U {(Xj, h m (Xi), 1)}. 
13: 

14: end if 

15: end for 

16: h M +1 := argmin^g'H err(/i,Z M ). 


Within the epoch, P m determines the probability of querying an example in the disagreement region for this set A,„ 
of “good” classifiers; examples outside this region are not queried but given labels predicted by h m . Consequently, 
the sample is not unbiased unlike some of the predecessors of our work. The various constants in Algorithm [I] must 
satisfy: 


a > 1, rj > 864, £ < 


< 


8 ne m log n ’ ' 8647716m log n 

Ci > 2aV&, C 2 > pci/4, C 3 > 1 . 


, 7 > p/4, 


(4) 


Epoch Schedules: The algorithm as stated takes an arbitrary epoch schedule subject to r„, < r m +i < 2r m . Two 
natural extremes are unit-length epochs, r m = m, and doubling epochs, r m +1 = 2r m . The main difference comes in 
the number of times (op) is solved, which is a substantial computational consideration. Unless otherwise stated, we 
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assume the doubling epoch schedule so that the query probability and ERM classifier are recomputed only 0(\ogn) 
times. 


Optimization problem (OP) to obtain P m : AC computes P m as the solution to the optimization problem (OP). 
In essence, the problem encodes the properties of a query probability function that are essential to ensure good gen¬ 
eralization, while maintaining a low label complexity. As we will discuss later, some of the previous works can be 
seen as specific ways of constructing feasible solutions to this optimization problem. The objective function of (op) 
encourages small query probabilities in order to minimize the label complexity. It might appear odd that we do not 
use the more obvious choice for objective which would be Ex [P(A')], however our choice simultaneously encourages 
low query probabilities and also provides a barrier for the constraint P(X) < 1-an important algorithmic aspect as 
we will discuss in Section^ 

The constraints ([5]) in (OP) bound the variance in our importance-weighted regret estimates for every h £ 'H. This 
is key to ensuring good generalization as we will later use Bernstein-style bounds which rely on our random variables 
having a small variance. Let us examine these constraints in more detail. The LHS of the constraints measures the 
variance in our empirical regret estimates for h, measured only on the examples in the disagreement region D m . This 
is because the importance weights in the form of l/P m (X) are only applied to these examples; outside this region 
we use the predicted labels with an importance weight of 1. The RHS of the constraint consists of three terms. The 
first term ensures the feasibility of the problem, as P(X) = 1 /(2a 2 ) for A' £ D m will always satisfy the constraints. 
The second empirical regret term makes the constraints easy to satisfy for bad hypotheses-this is crucial to rule out 
large label complexities in case there are bad hypotheses that disagree very often with h m . A benefit of this is easily 
seen when — h m £ H, which might have a terrible regret, but would force a near-constant query probability on the 
disagreement region if (3 = 0. Finally, the third term will be on the same order as the second one for hypotheses 
in A m , and is only included to capture the allowed level of slack in our constraints which will be exploited for the 
efficient implementation in Section [5] 

Of course, variance alone is not adequate to ensure concentration, and we also require the random variables of 
interest to be appropriately bounded. This is ensured through the constraints (|6j, which impose a minimum query 
probability on the disagreement region. Outside the disagreement region, we use the predicted label with an importance 
weight of 1, so that our estimates will always be bounded (albeit biased) in this region. Note that this optimization 
problem is written with respect to the marginal distribution of the data points Px , meaning that we might have infinite 
number of the latter constraints. In Section [5] we describe how to solve this optimization problem efficiently, and 
using access to only unlabeled examples drawn from . 

Finally we verify that the choices for P m according to some of the previous methods are indeed feasible in (OP). 
This is most easily seen for Oracular CAL )Hsu[ 20101 which queries with probability 1 if A' £ D m and 0 otherwise. 
Since a > 1 0 in the variance constraints (|5j, the choice P(A) = 1 for X £ D m is feasible for (OP), and conse¬ 
quently Oracular CAL always queries more often than the optimal distribution P m at each epoch. A similar argument 
can also be made for the IWAL method |Beygelzimer et al.| 2010) , which also queries in the disagreement region with 
probability 1, and hence suffers from the same sub-optimality compared to our choice. 


4 Generalization and Label Complexity 

We now present guarantees on the generalization error and label complexity of Algorithm [I] assuming a solver for 
(op), which we provide in the next section. 

4.1 Generalization guarantees 

Our first theorem provides a bound on generalization error. Define 

^ m 

err m (h) := — “ T j-iW(x,Y)~p[t{h(X) AX £ Dj)], 

Tm j=i 

Aq := A 0 and A* n := a \/e m eff m (h*) + c 2 e m log r m for m> 1. 
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Optimization Problem (OP) to compute P m 


min 

p 

s.t. 


Ex 


1 


1-P(X)\ 

t{h(x) f h m (x) Ax & D m ) 


\/h £ T~L Ex 


< b m (h), 


P(X) 

Wx £ X 0 < P(x) < 1, and Vx £ D m P(x) > P m in,m 


(5) 

( 6 ) 


where I™(X) = l(/i(a;) ^ h m (x) Ax £ D m ), 

b m (h) = 2a 2 E A -[X™(X)] + 2/3 2 'yieg(h, h m , +^r m -iA^ l _ 1 , and 

C3 


P, 


/ r m _ 1 err (h m , Z m _ i) . n 

> V L:; + lo s T — 1 


(7) 


Essentially A“, is a population counterpart of the quantity A m used in Algorithm[I| and crucially relies on erf m (h*), 
the true error of h* restricted to the disagreement region instead of the empirical error of the ERM at epoch to. This 
quantity captures the inherent noisiness of the problem, and modulates the transition between 0(1/^/n) to 0(l/n) 
type error bounds as we see next. 

Theorem 1. Pick any 0 < <5 < 1/e such that \H\/8 > v/192. Then recalling that h* = argmin^ gW err(/i), we have 
for all epochs to. = 1,2,..., M, with probability at least 1 — 5 

reg (h,h*) < 167A^ for all h £ A m+1 , and (8) 

reg(h*,h m+1 ,Z m ) < t?A TO /4. (9) 


The theorem is proved in Section 7.2.2 using the overall analysis framework described in Section[7] 

Since we use 7 > 7 /4, the bound (|9]l implies that h* £ A rn for all epochs to. This also maintains that all the 
predicted labels used by our algorithm are identical to those of h*, since no disagreement amongst classifiers in A m 
was observed on those examples. This observation will be critical to our proofs, where we will exploit the fact that 
using labels predicted by h* instead of observed labels on certain examples only introduces a bias in favor of h*, 
thereby ensuring that we never mistakenly drop the optimal classifier from our version space A rn . 

The bound <[ 8 ]> shows that every hypothesis in A m+1 has a small regret to h*. Since the ERM classifier h m+1 
is always in A m+ 1 , this yields our main generalization error bound on the classifier h Tm+ 1 output by Algorithm [l] 
Additionally, it also clarifies the definition of the sets A m as the set of good classifiers: these are classifiers which have 
small population regret relative to h* indeed. In the worst case, if err m {h*) is a constant, then the overall regret bound 
is 0(l/y/n). The actual rates implied by the theorem, however depend on the properties of the distribution and below 
we illustrate this with two corollaries. We start with a simple specialization to the realizable setting. 


Corollary 1 (Realizable case). Under the conditions of Theorem [7J suppose further that err (h*) = 0. Then A m = 
X* m = C 2 T m logr m and hence reg (h 1 h*) < 16c2T m log r m for all hypotheses h £ A m+1 . 
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In words, the corollary demonstrates a 0(\/n) rate after seeing n unlabeled examples in the realizable setting. Of 
course the use of err m {h*) in defining allows us to retain the fast rates even when h* makes some errors but they 
do not fall in the disagreement region of good classifiers. One intuitive condition that controls the errors within the 
disagreement region is the low-noise condition of |TsybakovU2004) , which asserts that there exist constants £ > 0 and 
0 < w < 1 such that 

Pr(/i(X) h*(X)) < £ • (err(ft) — errV/i £ "H such that err (h) — err (h*) < £q- (10) 

Under this assumption, the extreme lo = 0 corresponds to the worst-case setting while lo = 1 corresponds to h* having 
a zero error on disagreement set of the classifiers with regret at most e 0 . Under this assumption, we get the following 
corollary of Theorem [T] 

Corollary 2 (Tsybakov noise). Under conditions of Theorem [7] suppose further that Tsybakov’s low-noise condi- 

_ / . i 

tion © is satisfied with some parameters and Eq = 1. Then after m epochs, we have reg(/i, h*) = O r m 2 “ log(|"H|/5) 


The proof of this result is deferred to Appendix [E] It is worth noting that the rates obtained here are known to be 
unimprovable for even passive learning under the Tsybakov noise condition | Castro and Nowak 2008 10 Consequently, 
there is no loss of statistical efficiency in using our active learning approach. The result is easily extended for other 
values of Eq by using the worst-case bound until the first epoch mo when 167 Adrops below Eo and then apply our 
analysis above from mo onwards. We leave this development to the reader. 


4.2 Label complexity 

Generalization alone does not convey the entire quality of an active learning algorithm, since a trivial algorithm queries 
always with probability 1, thereby matching the generalization guarantees of passive learning. In this section, we show 
that our algorithm can achieve the aforementioned generalization guarantees, despite having a small label complexity 
in favorable situations. We begin with a worst-case result in the agnostic setting, and then describe a specific example 
which demonstrates some key differences of our approach from its predecessors. 


4.2.1 Disagreement-based label complexity bounds 


In order to quantify the extent of gains over passive learning, we measure the hardness of our problem using the 
disagreement coefficient |Hanneke]|2014|, which is defined as 


9 = 9(h*) := sup 

r> 0 


Px{x | 3 h £ TTs.t. h*(x) h{x), Px{x' \ h(x') f h*(x')} < r} 

r 


( 11 ) 


Intuitively, given a set of classifiers 'H and a data distribution P, an active learning problem is easy if good classi¬ 
fiers disagree on only a small fraction of the examples, so that the active learning algorithm can increasingly restrict 
attention only to this set. With this definition, we have the following result for the label complexity of Algorithm[I] 


Theorem 2. Under conditions of Theorem^ 7] with probability at least 1 — S, the number of label queries made by 
Algorithm^after n examples over M epochs is at most 


A9eTr M {h*)n + 9 ■ 0(y/neri M (h*) log( \H\/5) + log(|'H| J M)) + 41og(8(logn)/5). 


The proof is in Appendix [D] The dominant first term of the label complexity bound is linear in the number of 
unlabeled examples, but can be quite small if 9 is small, or if err m (IT ) « 0—it is indeed 0 in the realizable setting. 
We illustrate this aspect of the theorem with a corollary for the realizable setting. 


5 u in our statement of the low-noise condition {To} corresponds to 1 /k in the results of 


Castro and Nowak 


2008 . 


6 















Corollary 3 (Realizable case). Under the conditions of Theorem^ suppose further that err (h*) = 0. Then the 
expected number of label queries made by Algorithni^is at most 90{\og{\Tl\/ 5)). 

In words, we attain a logarithmic label complexity in the realizable setting, so long as the disagreement coefficient 
is bounded. We contrast this with the label complexity of IWAL |Beygelzimer et al. 20101, which grows as Ofin 
independent of err (h*). This leads to an exponential difference in the label complexities of the two methods in low- 
noise problems. A much closer comparison is with respect to the Oracular CAL algorithm [Hsu 2010| , which does 
have a dependence on \/nerr(Ji * ) in the second term, but has a worse dependence on the disagreement coefficient 6. 

Just like Corollary|2j we can also obtain improved bounds on label complexity under the Tsybakov noise condition. 

Corollary 4 (Tsybakov noise). Under conditions of Theorem [2] suppose further that the disagreement coefficient 9 is 
bounded and Tsybakov’s low-noise condition © is satisfied with some parameters oj, and £q = 1- Then after m 


epochs, the expected number of label queries made by Algorithm 


is at most O 


2 ( 1 - 
. 2-u 


!og(| H\/6) 


The proof of this result is deferred to Appendix[E] The label complexity obtained above is indeed optimal in terms 
of the dependence on n, the number of unlabeled examples, matching known information-theoretic rates of Castro and 
Nowak 1 2008| when the disagreement coefficient 9 is bounded. This can be seen since the regret from Corollary |2]falls 

as a function of the number of queries at a rate of ©(gm 2 ' 1-1 "’ log(|'H|/<5)) after m epochs, where q m is the number 
of label queries. This is indeed optimal according to the lower bounds of Castro and Nowak | 2008| , after recalling 
that to = 1 /k in their results. Once again, the corollary highlights our improvements on top of IWAL, which does not 
attain this optimal label complexity. 

These results, while strong, still do not completely capture the performance of our method. Indeed the proofs of 
these results are entirely based on the fact that we do not query outside the disagreement region, a property shared by 


the previous Oracular CAL algorithm |Hsu 20101. Indeed we only improve upon that result as we use more refined 
error bounds to define the disagreement region. However, such analysis completely ignores the fact that we construct a 
rather non-trivial query probability function on the disagreement region, as opposed to using any constant probability 
of querying over this entire region. This gives our algorithm the ability to query much more rarely even over the 
disagreement region, if the queries do not provide much information regarding the optimal hypothesis h*. The next 
section illustrates an example where this gain can be quantified. 


4 . 2.2 Improved label complexity for a hard problem instance 

We now present an example where the label complexity of Algorithm |T] is significantly smaller than both IWAL and 
Oracular CAL by virtue of rarely querying in the disagreement region. The example considers a distribution and a 
classifier space with the following structure: (i) for most examples a single good classifier predicts differently from 
the remaining classifiers (ii) on a few examples half the classifiers predict one way and half the other. In the first case, 
little advantage is gained from a label because it provides evidence against only a single classifier. Active Cover 
queries over the disagreement region with a probability close to P m ; n in case (i) and probability 1 in case (ii), while 
others query with probability 0(1) everywhere implying O(ffin) times more queries. 

Concretely, we consider the following binary classification problem. Let TL denote the finite classifier space 
(defined later), and distinguish some h* £ H. Let U{ — 1,1} denote the uniform distribution on { — 1,1}. The data 
distribution D(X. y) and the classifiers are defined jointly: 

• With probability e, 


y = h*(x), h(x) Wh^h*. 


• With probability 1 — e, 

y~U{- 1,1}, h*(x) ~ U (—1,1}, 

h r (x) = —h*(x) for some h r drawn uniformly at random from TL\h*, 
h{x) = h*(x) h r . 
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Indeed, h* is the best classifier because err(/i*) = e-0+(1 — e)(l/2) = (1 — e)/2, while err(/i) = 1/2 V/i ^ h*. This 
problem is hard because only a small fraction of examples contain information about h*. Ideally we want to focus label 
queries on those informative examples while skipping the uninformative ones. However, algorithms like IWAL, or 
more generally, active learning algorithms that determine label query probabilities based on error differences between 
a pair of classifiers, query frequently on the uninformative examples. Let u(h, h') := l(h(x) 7 ^ y) — l(h'(x) 7 ^ y) 
denote the error difference between two different classifiers h and h!. Let C be a random variable such that C = 1 for 
the e case and C = 0 for the 1 — e case. Then it is easy to see that 

fo, h^h*,h'^h*, 

E[u(h, h')\C= 1] = < — 1 / 2 , h = h*,h'^h*, 

[ 1/2, h^h*,h' = h*, 

E[u(h, h')\C = 0] = 0, Vh^ti. 

Therefore, IWAL queries all the time on uninformative examples (C = 0). 

Now let us consider the label complexity of Algorithm [T] on this problem. Let us focus on the query probability 
inside the 1 — e region, and fix it to some constant p. Let us also allow a query probability of 1 on the e region. Then the 
left hand side in the constraint ([5]) for any classifier h is at most e + P(h(X) 7 ^ h m (X))/p < e + 2/(p(\'H\ — 1)), since 
h and h m disagree only on those points in the 1 — e region where one of them is picked as the disagreeing classifier 
h r in the random draw. On the other hand, the RHS of the constraints is at least £-r m _> £err(/i m , Z 1n - \), 
which is at least £/4 as long as e is small enough and r m is large enough for empirical error to be close to true error. 
Consequently, assuming that e < £/ 8 , we find that any p > 16/(^(|7f| — 1)) satisfies the constraints. Of course we 
also have that p > which is 0(1/sjxn/) in this case since err m (h*) is a constant. Consequently, for |' H | large 

enough p = / J lriiri r „ is feasible and hence optimal for the population (op). Since we find an approximately optimal 
solution based on Theorem [4] the label complexity at epoch m is 0( 1/yTy^). Summing things up, it can then be 
checked easily that we make 0(y/n) queries over n examples, a factor of sjn smaller than baselines such as IWAL 
and Oracular CAL on this example. 


5 Efficient implementation 


In Algorithm[I] the computation of h m is an ERM operation, which can be performed efficiently whenever an efficient 
passive learner is available. However, several other hurdles remain. Testing for x € l) m in the algorithm, as well 
as finding a solution to (op) are considerably more challenging. The epoch schedule helps, but (op) is still solved 
0(logn) times, necessitating an extremely efficient solver. 

Starting with the first issue, we follow Dasgupta et al. 1 2007) who cleverly observed that x £ D rn can be efficiently 
determined using a single call to an ERM oracle. Specifically, to apply their method, we use the oracle to finclj 
h' = arg min{err(/i, Z m -i) \ h £ H, h(x) 7 ^ h m (x)}. It can then be argued that x £ D m = DIS(A m ) if and only if 
the easily-measured regret of h! (that is, reg (h', h mi Z m -i )) is at most r yA m _i. 

Solving (op) efficiently is a much bigger challenge because, as an optimization problem, it is enormous: There is 
one variable P(x) for every point x £ X, one constraint © for each classifier h and bound constraints ([ 6 ]) on P(x) 
for every x. This leads to infinitely many variables and constraints, with an ERM oracle being the only computational 
primitive available. Another difficulty is that (op) is defined in terms of the true expectation with respect to the 
example distribution P^, which is unavailable. 

In the following we first demonstrate how to efficiently solve (op) assuming access to the true expectation Ex [•], 
and then discuss a relaxation that uses expectation over samples. For the ease of exposition, we recall the shorthand 

1™(x) = 1 (h(x) 7 ^ h m (x) Ax £ D m ) from earlier. 


6 We only have access to an unconstrained oracle. But that is adequate to solve with one constraint. See Appendix F of jKarampatziakis and] 
|Langford||2011[ for details. 
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Algorithm 2 Coordinate ascent algorithm to solve (op) 


input Accuracy parameter e > 0. initialize A 0. 

1 : loop 

2 : Rescale: A 4— s ■ A where s = arg max sg [ 01 ] V(s ■ A). 


Find h = arg max E x 
hen 


J-(X) 


if E x 


If{X) 


Px(X) 

return A 
else 


Px(X) 

— b m (h) < £ then 


bm(h)' 


Update A^ as A^ 4 A^ T 2 

end if 


E x [l^(X)/P x (X)\-b m (h) 
E x [l^{X)/q x (Xf] 


9: end loop 


5.1 Solving (OP) with the true expectation 

The main challenge here is that the optimization variable P(x) is of infinite dimension. We deal with this difficulty 
using Lagrange duality, which leads to a dual representation of P(x) in terms of a set of classifiers found through 
successive calls to an ERM oracle. As will become clear shortly, each of these classifiers corresponds to the most 
violated variance constraint © under some intermediate query probability function. Thus at a high level, our strategy 
is to expand the set of classifiers for representing P{x) until the amount of constraint violation gets reduced to an 
acceptable level. 

We start by eliminating the bound constraints using barrier functions. Notice that the objective Ex [1/(1 — P(a;))] 
is already a barrier at P(x) = 1. To enforce the lower bound ([6]), we modify the objective to 


E x 


1 

1-PPO. 


+ li 2 E x 


MX £ An) 

P(X) 


( 12 ) 


where p is a parameter chosen momentarily to ensure P(x) > Pmin.m for all x £ D m . Thus, the modified goal is to 
minimize ( [12) over non-negative P subject only to ([5]». 

We solve the problem in the dual where we have a large but finite number of optimization variables, and efficiently 
maximize the dual using coordinate ascent with access to an ERM oracle over H. Let A/, > 0 denote the Lagrange 
multiplier for the constraint ([5]> for classifier h. Then for any A, we can minimize the Lagrangian 


C(P, A) :=Ex 


l-P(X) 


+ /r 2 Ex 


MX G An) 
P(X) 


h£U 


-Y, Xh [bm(h)-E x 


MMX) ^ h m {X) A X £ D m ) 


P(X) 


• each primal variable P{x) £ [0,1] yielding the solution. 


(13) 


p \M) 


1 ( 3 ; € Dm)q\{x) 

1 + q\(x) 


where q\(x) 



E w(*)- 

hen 


(14) 


To see this, pick any P satisfying P(x) £ [0,1] for a 11 x £ X and consider the difference in the Lagrangians evaluated 
at P and P x : 


C(P,\)-C(P X ,\) = Ex MX^D m )\^ 

+Ex 1(A £ D m ) 


1 


- 1 


1 - P(X) 

( 1 . M 2 + Yhhen MPh'ix) 


l-P(X) 


P(X) 


-(l+qx(X)f 
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The first term is non-negative because P(x) G [0,1]. For the second term, notice that 

„ ^ n J 1 , V 2 + E h& nW(x) 

P x (x) = arg o mm i l(z £ D m ) + --- 


and that the minimum function value is exactly \(x £ l ) m )(1 + q x (x)) 2 . Hence the second term is also non-negative. 

Clearly, p/( 1 + p) < P x {x) < 1 for all x £ D m , so all the bound constraints (|6]i in (op) are satisfied if we 
choose p = 2P m i n m . Plugging the solution P x into the Lagrangian, we obtain the dual problem of maximizing the 
dual objective 

V(\) = Ex [t(X £ D m ){ 1 + q x (X)) 2 ] - J2 A hbm(h) + C 0 (15) 

hen 

over A > 0. The constant Co is equal to 1 — Pr(D m ) where I > r( D m ) = Pr(X £ D m ). An algorithm to approximately 
solve this problem is presented in Algorithm [2] The algorithm takes a parameter e > 0 specifying the degree to 
which all of the constraints 0 are to be approximated. Since V is concave, the rescaling step can be solved using a 
straightforward numerical line search. The main implementation challenge is in finding the most violated constraint 
(Step [3]). Fortunately, this step can be reduced to a single call to an ERM oracle. To see this, note that the constraint 
violation on classifier h can be written as 


E x 


vnxy 

P{X) _ 


bm(h) = Ex- 


11 (X g D m ) 


P(X) 


- 2a 2 l(h(X) ^ h m (X)) 


2/3 q'Ty^—iA m _x(eir(/i, eri(/t m , ^ m _x)) ^r m _i 


The first term of the right-hand expression is the risk (classification error) of h in predicting samples labeled according 
to h m with importance weights of 1/P(x) — 2a 2 if x £ D m and 0 otherwise; note that these weights may be positive 
or negative. The second term is simply the scaled risk of h with respect to the actual labels. The last two terms do not 
depend on h. Thus, given access to Px (or samples approximating it, discussed shortly), the most violated constraint 
can be found by solving an ERM problem debited on the labeled samples in and samples drawn from Px labeled 

by h m , with appropriate importance weights detailed in Appendix |F.1| 

When all primal constraints are approximately satisbed, the algorithm stops. Consequently, we can execute each 
step of Algorithm [2] with one call to an appropriately debited ERM oracle, and approximate primal feasibility is 
guaranteed when the algorithm stops. More specihcally, we can prove the following guarantee on the convergence of 
the algorithm. 

Theorem 3. When run on the m-th epoch, Algorithm^has the following guarantees. 

1. It halts in at most iterations. 


2. The solution A > 0 it outputs has bounded i-\ norm: II A|| i < Pr(D m )/e. 

3. The query probability function P ^ satisfies: 

• The variance constraints 0 up to an additive factor of e, i.e., 


\/h£n Ex 


t(h(x) h m (x) Ax £ Dm) 

Px(X) 


< b m (h)+e, 


• The simple bound constraints 0 exactly, 

• Approximate primal optimality: 


Ex 


1 


where f* denotes the optimal value of { OP), i.e. 


</*+4P min , m Pr(D rn ), 


r := inf Ex 


1-P(X)\ 
:t. P satisfying 0 and 0 


(16) 


(17) 
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That is, we find a solution with small constraint violation to ensure generalization, and a small objective value to 
be label efficient. If e is set to £r m _ i A^ 1 _ 1 , an amount of constraint violation tolerable in our analysis, the number of 
iterations in Theorem ^varies between ) and as the err(/i m , Z m _ i) varies between a constant and 

(D{\JT m — i). The theorem is proved in Appendix |F.2| 


5.2 Solving (op) with expectation over samples 

So far we considered solving (op) defined on the unlabeled data distribution 1 P,y, which is not available in practice. A 
simple and natural substitute for is an i.i.d. sample drawn from it. Here we show that solving a properly-defined 
sample variant of (OP) leads to a solution to the original (op) with similar guarantees as in Theorem[3] 

More specifically, we define the following sample variant of (OP). Let S' be a large sample drawn i.i.d. from Vx , 
and ( 0 P 5 ) be the same as (op) except with all population expectations replaced by empirical expectations taken with 
respect to S. Now for any e > 0, define (OPs, E ) to be the same as ( 0 P 5 ) except that the variance constraints ([5]» are 
relaxed by an additive slack of e. 

Every time Active Cover needs to solve (op) (Step[5]of Algorithm[l]i, it draws a fresh unlabeled i.i.d. sample S 
of size u from P*, which can be done easily in a streaming setting by collecting the next u examples. It then applies 
Algorithm [ 2 ] to solve (OPs lE ) with accuracy parameter e. Note that this is different from solving (OPs) with accuracy 
parameter 2e. We establish the following convergence guarantees. 

Theorem 4 . Let S be an i.i.d. sample of size u from Px- When run on the m-th epoch for solving (oPg iE ) with 
accuracy parameter s, Algorithm^satisfies the following. 

1. It halts in at most JrVi Drn \ iterations, where Pr (D m ) := ffxeS 1(-^ € D m )/u. 

min ,m^ 

2. The solution A > 0 it outputs has bounded t\ norm: || A||i < Pr (D m )/e. 

3. If u > C , ((l/(Pmin,m£) 4 + o: 4 / e 2 ) log( [H |/5)), then with probability > 1 — 5, the query probability function 
P A satisfies: 

• All constraints of (OP) except with an additive slack of 2.5e in the variance constraints <0>. 

• Approximate primal optimality: 


E.v 


l-W 


< f* + 8P m in im Pr(D m ) + (2 + 4P m i nim )e, 


where f* is the optimal value of (OP) defined in 


The proof is in Appendix F.3 Intuitively, the optimal solution P* to (op) is also feasible in (OPs, E ) since sat¬ 


isfying the population constraints leads to approximate satisfaction of sample constraints. Since our solution l\ is 
approximately optimal for (oPs i£ ) (this is essentially due to Theorem |3j>, this means that the sample objective at /' A 
is not much larger than P*. We now use a concentration argument to show that this guarantee holds also for the pop¬ 
ulation objective with slightly worse constants. The approximate constraint satisfaction in (op) follows by a similar 
concentration argument. Our proofs use standard concentration inequalities along with Rademacher complexity to 
provide uniform guarantees for all vectors A with bounded t-\ norm. 

The first two statements, finite convergence and boundedness of |j Aj|i, are identical to Theorem[3]except Pr(D m ) is 
replaced by Pr(P m ). When e is set properly, i.e, to be £ 2 r m _i A^ l _ 1 , the number of unlabeled examples u in the third 
statement varies between 0(r^ n _ 1 ) and 0[r^ l _ 1 ) as the err (h m , Z m -\) varies between a constant and 0(l/r m _i). 
The third statement shows that with enough unlabeled examples, we can get a query probability function almost as 
good as the solution to the population problem (op). 
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Algorithm 3 Online Active Cover 
input: cover size l, parameters Co, a and /3 sco ; e . 

1 : Initialize online importance weighted minimization oracles {Ot} l t= Q, each controlling a classifier and some asso¬ 
ciated weights {(ft-t, A;, u t , Wf )}( =1 with all weights initialized to 0 . 

2 : For the first three examples , query the labels {F,}? . 

3 : Let/i:=O 0 ({PQ,^,l)}?=i). 

4: Get error estimate e 2 from O 0 and compute P m ; n 3 . 

5 : Let (X, Y*,Y, W) := (X 3 , Y 3 , h(X 3 ), 1). Set/3 - (^fat/co) / 0 acoIe . 

6 : for * = 4,..., n, do 

7: Update the ERM, the error estimate and the threshold 

h := O 0 ((X,Y*,W)), 

(i-2) ei _ 2 + t(Y^Y*)W 

G-i -:-,-> 

i — l 

Aj_i := \Jcoei-\/(i - 1) + max(2a,4)c 0 log(* - \)/{i - 1). 


8 : for t = 1 ,..., l do 

9: Compute p t — q t /{ 1 + g t ), where q t := y (2P min ,;-i ) 2 + £ f/<t A t lL(/i t (X) ± F)). 

10: Set up the cost of predicting y G {1, —1}, the target label and the importance weight: 

Cy := 2/3 2 (i - 2)Aj_ 2 l(t/ ^ Y*)W + ^2a 2 - t(X G A-t A y ^ F), 

:= argminc^, 
y 

W t ■■= |ci-c_i|, 

11 : Update the Pth classifier in the cover and its associated weights: 


h t := 

O t ((X,Y t ,W t )), 


v t ■= 

max (v t + 2(cy - c ht ( X )), o) , 

(19) 

u t := 

L0 t + 1 (h t (X) ± Y A X G A-i)/<? t 3 , 

( 20 ) 

At := 

Wt 

( 21 ) 

end for 

Receive new data point X, and let Y := 

_h(Xi). 



if Xi G Di := DIS(Aj), then 

Compute Pi := q/(l + q), where q := yj (2P min ,,:) 2 + Y!t=i A i l(ft t (A i ) ^ F)). 
Draw Q ~ Bernoulli(P). 

if Q = l then 

Query y* and set (X, Y*,W) := (X,, Y Z1 1/PQ. 

else 

Set (X, F*, IF) := (X.,,1,0). 

end if 
else 

Set (X, F*, IF) := (X i5 /i(XQ, 1). 

end if 
end for 
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6 Experiments with Agnostic Active Learning 


While AC is efficient in the number of ERM oracle calls, it needs to store all past examples, resulting in large space 
complexity. As Theorem[3]suggests, the query probability function ( [T4| may need as many as 0(t?) classifiers, further 
increasing storage demand. In Section 6.1 we discuss a scalable online approximation to Active Cover, Online 
Active Cover (oac), which we implemented and tested empirically with the setup in Section 6.2 Experimental 
results and discussions are in Section [(PI 


6.1 Online Active Cover (OAC) 


Algorithm[3]gives the online approximation that we implemented, which uses an epoch schedule of r, = i, assigning 
every new example to a new epoch. 

To explain the connections between Algorithms[T](AC) and[3](OAC), we start with the update of the ERM classifier 
and thresholds, corresponding to Step[l]of AC and Step[7]of OAC. Instead of batch ERM oracles, OAC invokes online 
importance weighted ERM oracles that are stateful and process examples in a streaming fashion without the need to 
store them. The specific importance weighted oracle we use is a reduction to online importance-weighted logistic 
regression | Karampatziakis and Langford 201 1| implemented in Vowpal Wabbit (VW). Y* denotes the actual label 
that is used to update the ERM classifier and, depending on the query decision (Steps [9] to 14 1 , can be a queried 
label, a predicted label by the previous ERM classifier, or a dummy label of 1 associated with an importance weight 
of zero. The error variable e,;_i keeps track of the progressive validation loss, which is a better estimate of the true 
classification error than the training error | Blum et al. |1999 Cesa-Bianchi et al. 20041. 

Instead of computing the query probability function by solving a batch optimization problem as in Step[5]of AC, 
OAC maintains a fixed number l of classifiers that are intended to be a cover of the set of good classifiers. On every 
new example, this cover undergoes a sequence of online, importance weighted updates (Steps [ 8 ]t o [l2] of OAC), which 
are meant to approximate the coordinate ascent steps in Algorithm [2] The importance structure ( | 18 [ > is derived from 
accounting for the fact that the algorithm simply uses the incoming stream of examples to estimate Ejf ['] rather 
than a separate unlabeled sample. The same approximation is also present in the updates ( [19) and ( |20| i, which are 
online estimates of the numerator and the denominator of the additive coordinate update in Step [7] of Algorithm [2] 
Because ( |T9| ) is an online estimate, we need to explicitly enforce non-negativity. Note that ( [19) has the following 
straightforward interpretation: if the prediction of h t , the /-th classifier in the cover, is the same as that of the ERM, 
the weight associated with h t will not change. Otherwise, the weight of h t increases/decreases when its prediction has 
a smaller/larger cost than the prediction of the ERM. 

To further clarify the effect of ( p~8] >, we perform the following case analysis: 

• If X t _i £ Dj_i, then for all t £ {1,..., Z}, 

( c y,c_y) = (0,2/3 {i 2)A^— 2 )> 

so Y t = Y. This means that all the classifiers in the cover are trained with the predicted label when the example 
is outside of the disagreement region. 


• Otherwise, the costs for the /-th classifier in the cover are: 

! ( 0 , 2a 2 — 1/pt), Q = 0 ,i.e., the true label was not queried, 

(0,2a 2 - 1/ Pt + 2/3 2 (i - 2)A i _ 2 /P i _ 1 ), Q = 1, Y = Y t _ u 

(2/3 2 {i - 2)A i _ 2 /P i _i, 2a 2 - 1/ Pt ), Q = 1, Y ± Y i=% . 

In the first case, if p t > l/(2o: 2 ), i.e., the query probability based on the previous t — 1 classifiers in the cover is 

large enough, then c _y > 0 and the /-th classifier will be trained to agree with the predicted label. Otherwise, 

the f-th classifier will be trained to disagree with the predicted label, thereby increasing the query probability. 
In the second case, the true label Y/_i was queried and found to be the same as the predicted label, so unless 
p t is very small, the /-th classifier will not be trained to disagree with the ERM h. In the third case, the cost 
associated with the predicted label Y is always positive, so the true label T/_ | will be preferred unless pt or a 
is fairly large. 
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Finally, Steps[9]to 14 of AC and Steps |T3] to [25] of OAC perform the querying of labels. As pointed out in Section 


pi the test in StepfTSlof OAC is done via an online technique detailed in Appendix F of Karampatziakis and Langford 


6.2 Experiment Setting 

We conduct an empirical comparison of OAC with the following active learning algorithms. 


IWALq: Algorithm 1 of Beygelzimer et al. |2010[ , which performs importance-weighted sampling of labels and 
maintains an unbiased estimate of classification error. On every new example, it queries the true label with 
probability 1 if the error difference Gk (Step 2 in Algorithm 1 of Beygelzimer et al. 1 2010) ) is smaller than the 
threshold 

/ C 0 log k t Co log k 

k - 1 + jfe-l ’ ( ) 

where Cq is a hyper-parameter. Otherwise, the query probability is a decreasing function of GV 

• iWALi: A slight modification of IWALq that uses a more aggressive, error-dependent threshold: 


C 0 log k C 0 log k 

e-k -i + 


k- 1 


k- 1 


(23) 


where ek-i is the importance-weighted error estimate after the algorithm processes k — 1 examples. 


ORA-IWAL 0 : An Oracular-CAL flHsu |2010} style variant of IWAL 0 that queries the label of a new example with 
probability 1 if the error difference Gk (see IWALq above) is smaller than the threshold ( |22j i. Otherwise, it uses 
the predicted label by the current ERM classifier. 


• ORA-lWALi: An Oracular-CAL |Hsu| |2010| style variant of IWALi that resembles ORA-lWALo except that it 
uses the error-dependent threshold ( |23| . Note that the error estimate < J k-\ now uses both the queried labels and 
predicted labels, and is no longer unbiased. We remark that a theoretical analysis of this algorithm has recently 
been given by [Zhang | 2015) . In fact, it is almost identical to an Oracular-CAL Hsuj 2010) style variant of 
Algorithm [3] that uses a query probability l\ of 1 whenever the disagreement test in Step [15] o f Algorithm [3] 
returns true, except that its threshold ( |23| l is slightly different from the one used by Algorithm[3](Step[7]i. 

• PASSIVE: Passive learning using all the labels of incoming examples up to some label budget. 

We implemented these algorithms in Vowpal Wabbi|^](VW), a fast learning system using online convex optimiza¬ 
tion, which fits nicely with the streaming active learning setting. We performed experiments on 22 binary classification 
datasets with varying sizes (10 3 to 10 6 ) and diverse feature characteristics. Details about the datasets are in Appendix 
EH Our goals are: 


1. Investigating the maximal test error improvement per label query achievable by different algorithms; 


2. Comparing different algorithms when each uses the best fixed hyper-parameter setting. 

We thus consider the following experiment setting. To simulate the streaming setting, we randomly permuted the 
datasets, ran the active learning algorithms through the first 80% of data, and evaluated the learned classifiers on the 
remaining 20%. We repeated this process 9 times to reduce variance due to random permutation. For each active 
learning algorithm, we obtain the test error rates of classifiers trained at doubling numbers of label queries starting 
from 10 to 10240. Formally, let error a p (d, j, q) denote the test error of the classifier returned by algorithm a using 
hyper-parameter setting p on the j-th permutation of dataset d under a label budget of 10 • 2 <q ~ % 1 < q < 11 , and 
query a p (d, j, q) denote the actual number of label queries made. Note that under the same label budget, OAC and 
the Oracular-CAL variants may use more example-label pairs for learning than IWALo and IWALi because the former 
algorithms use predicted labels. Also note that query 0 p (d,j, q) < 10 • 2^ 9-1 ^ when algorithm a reaches the end of 

y http://hunch.net/~vw/ 
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Table 1: Summary of performance metrics 



OAC 

IWAL 0 

IWALi 

ORA-IWALq 

ORA-lWALi 

PASSIVE 

AUC-GAIN* 

0.1611 

0.1466 

0.1552 

0.1586 

0.1549 

0.0950 

AUC-GAIN 

0.0722 

0.0863 

0.0755 

0.0945 

0.0807 

0.0718 


the training data before hitting the g-th label budget. To evaluate the overall performance of an algorithm, we consider 
the area under its curve of test error against log number of label queries: 


1 to , 

AUCa iP (d,j) = - ^ (error 0 iP (d, j,q + 1) + error aiP (d, j, q)J • flog 2 
1 9=1 V ^ 


query a ,p(d,j,q+l) \ 
query a ,p(dJ,q) )' 


(24) 


A good active learning algorithm has a small value of AUC, which indicates that the test error decreases quickly as 
the number of label queries increases. We use a logarithmic scale for the number of label queries to focus on the 
performance under few label queries where active learning is the most relevant. More details about hyper-parameters 
are in Appendix |G.2| 

For the first goal, we compare the performances of different algorithms optimized on a per dataset basis. More 
specifically, we measure of the performance of algorithm a by the following aggregated metric: 


AUC-GAIN* (a) := mean max median 
d p l<j<9 


AUC base (d,j) - AUC a, P (d,j) 
AUCbase(d, j ) 


(25) 


where AUC{, ase denotes the AUC of PASSIVE using a default hyper-parameter setting, corresponding to a learning 


rate of 0.4 (see Appendix G.2 for more details). In this metric, we first take the median of the relative test error 
improvements over the PASSIVE baseline, which gives a representative performance among the 9 random permutations, 
and then take the maximum of the medians over hyper-parameters, and finally average over datasets. This metric shows 
the maximal gain each algorithm achieves with the best hyper-parameter setting for each dataset. 

In practice it is difficult to select active learning hyper-parameters on a per-dataset basis because labeled validation 
data are not available. With a variety of classification datasets, a reasonable alternative might be to look for the single 
hyper-parameter setting that performs the best on average across datasets, thereby reducing over-fitting to any indi¬ 
vidual dataset, and compare different algorithms under such fixed parameter settings. We thus consider the following 
metric: 

f AUC 6 ose (d,j)-AUC a , p (d,j)- 


AUC-GAIN(a) := max mean median { : , , , , 

p d t <j<9 { AUC bose {d,j) 

which first averages the median improvements over datasets and then maximizes over hyper-parameter settings. 


(26) 


6.3 Results and Discussions 


Table [I] gives a summary of the performances of different algorithms, measured by the two metrics AUC-GAIN* 
( |25) > and AUC-GAIN ( |26l >. When using hyper-parameters optimized on a per-dataset basis (top row in Table]]]), OAC 
achieves the largest improvement over the PASSIVE baseline, with ORA-IWAL 0 achieving almost the same improve¬ 
ment and other active learning algorithms improving slightly less. When using the best fixed hyper-parameter setting 
across all datasets (bottom row in Table [TJ, all active learning algorithms achieve less improvement compared with 
PASSIVE, which achieves a 7% improvement with the best fixed learning rate. ORA-lWALo performs the best, achiev¬ 
ing a 9% improvement, while IWALo and ORA-lWALi achieve more than 8%. Both IWALi and OAC achieve around 
7.5% improvements, slightly better than PASSIVE. This suggests that careful tuning of hyper-parameters is critical for 
OAC and an important direction for future work. 

To describe the behaviors of different algorithms in more details, we plot the relative improvement in test error 
against number of label queries. In Figure [1(a)] for each algorithm a we identify the best fixed hyper-parameter setting 


p* := arg max mean median 

p d l<j<9 


AUC base(d,j) - AUC a,p(d,j) 
AUCf , ase (d,j) 


(27) 
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(a) Median over permutations (b) The first permutation (c) Three quartiles over permutations 


Figure 1: Relative improvement in test error v.s. number of label queries under the best fixed hyper-parameter setting 
across datasets. Results are averaged over all datasets. 



(a) Median over permutations (b) The first permutation 



(c) Three quartiles over permutations 


Figure 2: Relative improvement in test error v.s. number of label queries under the hyper-parameter settings optimized 
on a per dataset basis. Results are averaged over all datasets. 


and plot the relative test error improvement by a using p* averaged across all datasets at the 11 label budgets: 

it 


ln „(„_!) f error base (d,j,q) - error atP *(d,j,q) 

10 • 2 W ; , mean median < -—-,- 

d i<i<9 L errorwp.l 0,7,5) 


error base (d,j,5) 


(28) 


9=1 


The two IWAL algorithms start off badly at small numbers of label queries, but outperform other algorithms after 
100-or-so label queries. OAC performs better than the two Oracular-CAL algorithms until a few hundred label queries, 
but becomes worse afterwards. 

To give a sense of the variation due to random permutation, we plot in Figure 1(b) average results on the first 
permutation of each dataset, i.e., instead of taking the median in ( 


, we simply took results from the first permutation. 
Figures|T|and |l(b)| suggest that variation due to permuting the data is quite large, especially for the two Oracular-CAL 
algorithms and IWALi. Figure 1(c) gives another view that shows variation for OAC, ORA-lWALo, and PASSIVE: in 
addition to the median improvement, we also plot error bars corresponding to the first and the third quartiles of the 
relative improvement over random permutations, i.e., ( |28| ) with median replaced by the two quartiles, respectively. 

we plot results obtained by each algorithm a using the best hyper-parameter setting for 


In Figures 2(a) 
each dataset d: 


to 


2 (c) 


p") := argmax median < 
p 1<7<9 1 


( AUC base (d,j) - AUC a , p (d,j) 


AUC base {d,j) 


(29) 


As expected, all algorithms perform better by using the best hyper-parameter setting for each dataset. Note that OAC 
performs the best at small numbers of label queries, but after a few hundred label queries all active learning algorithms 
perform quite similarly. 
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Figure 3: Test error under the best hyper-parameter setting for each dataset v.s. number of label queries 


Finally in Figure [3] we show the test error rates obtained by OAC, ORA-IWAL 0 , and PASSIVE against number of 
label queries for 2 of the 22 datasets, using the best hyper-parameter setting for each dataset. Results for all datasets 
and all algorithms are in Appendix |G.3| 

In sum, when using the best fixed hyper-parameter setting, ORA-lWALo outperforms other active learning algo¬ 
rithms. When using the best hyper-parameter setting tuned for each dataset, OAC and ORA-lWALo perform equally 
well and better than other algorithms. 


7 Analysis of generalization ability 

In this section we present the main framework and analysis for the results on the generalization properties of the AC¬ 
TIVE Cover algorithm. Our analysis is broken up into several steps. We start by setting up some additional notation 
for the proofs. Our analysis relies on two deviation bounds for the empirical regret and the empirical error of the ERM 
classifier. These are obtained by appropriately applying Freedman-style concentration bounds for martingales. Both 
these bounds depend on the variance and range of our error and regret estimates for all classifiers h £ 7~L. and these 
quantities are controlled using the constraints ([5]) and ([6]) in the definition of the optimization problem (OP). Since 
our data consists of examples from different epochs, which use different query probabilities P m , the above steps with 
appropriate manipulations yield bounds for the epoch to, in terms of various quantities involving the previous epochs. 
Theorem |T] and its corollaries are then obtained by setting up appropriate inductive claims. We make this intuition 
precise in the following sections. 


7.1 Framework for generalization analysis 

Before we can prove our main results, we recall some notations and introduce a few additional ones. We also prove 
some technical lemmas in this section which are used to prove our main results. 

Recall the notation reg (h, h') := err (h) — err(ft/), h* £ argmin^g^ err (h), reg (h) := reg (h, h*). Let Z m denote 
the set of importance-weighted examples in Z rn , and the corresponding empirical error is denoted as: 


m ' j 

err (h, Z m ) ( 

/ m. ' 


j — 1 I=Tj_l+l 


Qit{h{Xi) ^ Y t A X, e Dj) 
Pj{Xi) 


(30) 


Taking expectations, we define the following quantities with respect to the sequence of regions { D m }: 
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(31) 


err m (h) := E x , r [l(h(X) ^ Y A X G D m )\, 

^ m 

err m {h) := —^(t,- - T,_i)err,(/i). 

r"m . 

J=1 

Intuitively, err m captures the population error of h, restricted to only the examples in the disagreement region. This is 
also the expectation of the sample error restricted to the importance-weighted examples in epoch to. Averaging these 
quantities, we obtain err m which is the expectation of the sample error over Z rn . Centering around the corresponding 
errors of h*, we obtain the following regret terms: 

reg m(h) := err m (/i) - err m (/i*), 

^ m 

regmO) : = —E^ ~ Tj-^vegjih). 

Tm 3=1 

While the above quantities only concern the importance-weighted examples, it is also useful to measure error and 
regret terms over the entire biased sample. We define the empirical error and regret on Z. m as follows: 


err (h, Z m ) := ^E E (HHX i )^h j (X i )AX i ^D j ) + 

m j= 1 i=Tj- i+l 

reg (h, h!, Z m ) := err (h, Z m ) - err (ft', Z m ), 


Qit{h(Xi) ^Yi A Xi G Dj) 

Pj(Xi) 


and the associated expected regret: 


reg:= E x [(l{h(X) ? h m {X)) - t(h'(X) ± h m (X)))l(X £ D m )] + 

%,y[( Hh{X) ?Y)- 1 (ti(X) ^ Y))1(X G D m )], (32) 

reg n(h,h') ■■= —(33) 

Tm 3=1 

The quantity reg^ (/i, h') will play quite a central role in our analysis as it is the expectation of the empirical regret of 
h relative to h' on our biased sample Z rn . We also recall the earlier notations 


ci V err(/i TO+ i 

j Zm ) H” ^2 €m Trno 

{hen I err(ft, Z m ) - err(/i m+ i, Z m ) < 7 A m }, and 

f (ci-y/e m eff m (/i*) + c 2 e m logr m ) , to > 1. 

\a 0 , to = 0. 

Unless stated otherwise, we adopt the convention that in the quantities defined above, summations from 1 to to 
take the value of zero when m = 0. We use the shorthand rn(i) to denote the epoch containing example i. We 
also sometimes use the shorthand reg (h,Z m ) := reg(h, h m+ i, Z m ), regJ„(h) := regf„(/i, h*), and feg^(/i) := 
reg Z(h,h*). 

With the notations in place, we start with an extremely important lemma, which shows that the biased sample 
Z which we create introduces a bias in the favor of good hypotheses, overly penalizing the bad hypotheses while 
favorably evaluating the optimal h*. 


A m . 

-3 rn 4-1 : = 


A1 := 
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Lemma 1 (Favorable Bias). Vto > 1, Vft £ A m ,\/h £ H, the following holds: 

regL(ft, ft) > reg(ft, ft). 

The next key ingredient for our proofs is a deviation bound, which will be appropriately used to control the 
deviation of the empirical regret and error terms. 

Lemma 2 (Deviation Bounds). Pick 0 < <5 < 1/e such that \H\/8 > a/ 192. With probability at least 1 — 5 the 
following holds. For all (ft, ft') £ PL 2 and\/m > 1, 


|reg m (h, h!) - reg(ft, h', Z m )\ 


< 


\ 




Hx i a ) + H p ^ i] ) MHX) ± h'{x)) 


1 min ,ra 

I err (h, Z m ) - err m (h)\ 


< 


\ 


— ^ ~2(,Ti - Ti-l)Ex,Y 
7~m , 

i=l 


1 (X£ DiAh(X)^Y) 


Pi(X) 


P ■ 

1 min.m 


(34) 


(35) 


where 




^ logd'Hl/i)) + logr m ^ 


The lemma is obtained by applying a form of Freedman’s inequality presented in Appendix [A] Intuitively, the 
deviations are small so long as the average importance weights over the disagreement region and the minimum query 
probability over the disagreement region are well-behaved. This lemma also highlights why reg)/ is a very natural 
quantity for our analysis, since the empirical regret on our biased sample Z concentrates around it. 

To keep the handling of probabilities simple, we assume for the bulk of this section that the conclusions of Lemma|2] 
hold deterministically. The failure probability is handled once at the end to establish our main results. Let £ denote 
the event that the assertions of Lemma[2]hold deterministically, and we know that Pi/A ) < <5. Based on the above 
lemma, we obtain the following propositions for the concentration of empirical regret and error terms. 


Proposition 1 (Regret concentration). Fix an epoch m > 1. Suppose the event £ holds and assume that h* £ Aj for 
all epochs j < to. 


|reg (h,h*,Z m ) ~ ™g m {h,h*)\ 


< 4 r eg m W + 2 a 


\ 


— ~ r»_i)reg i (/ii) + 2a/3eiT m (/i»)f 




2ye m \n - T»_i)(reg(/i, A_i) + reg (ft*, Yi)) + 4A„ 


We need an analogous result for the empirical error of the ERM at each epoch. 

Proposition 2 (Error concentration). Fix an epoch m > 1. Suppose the event £ holds and assume that h* £ Aj for 
all epochs j < to. 

|en m (ft ) eri(ft m _|_i, Z m )\ A ^ T ^ -t - reg (ft ; ftm+i; Z rn j. 


We now present the proofs of our main results based on these propositions. 
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7.2 Proofs of main results 


We prove a more general version of the theorem. Theorem [I] and its corollaries follow as consequences of this more 
general result. 

Theorem 5. For all epochs m = 1, 2,..., M and all h £ FI, the following holds with probability at least 1 — 5: 


|reg(/i, h*, Z m ) — reg m (/i, h*)\ < 

, Z m ) ^ 

|efr m (h*) - err(/i m+1 , Z m )\ < 


7 , reg m (h,h*) + | A m , 




and h* £ A, , 


+ 1 A m . 
2 2 


(36) 

(37) 

(38) 


The theorem is proved inductively. We first give the proof outline for this theorem, and then show how Theorem[l] 
and its corollaries follow. 


7.2.1 Proof of Theorem H| 

The theorem is proved via induction. Let us start with the base case for m = 1. Clearly, A-\ = 77 9 h", and 

|reg(M*,Zi) - reg \{h,h*)\ < 1 < ryAi/4, 


since P m in,i = 1- The conclusions for the second and third statements follow similarly. This establishes the base case. 
Let us now assume that the hypothesis holds for i = 1,2,..., m — 1 and we establish it for the epoch i = m. We start 
from the conclusion of Proposition [I] which yields 


|reg(/i, h*, Z m ) — reg m (/i, h*)\ 
1 


< -reg m {h) + 2a. — - r i _i)reg i (/i») + 2a\/3err m (/i*)e m 

. '- r a -- 


-v— 

7i 




276mA m - r i _ 1 )(reg(ft, Z ,_ L ) + reg(/i*, Z^ff) +4A„ 


r 3 


We now control 71, 71 and 7! in the sum using our inductive hypothesis and the propositions in a series of lemmas. 
To state the lemmas cleanly, let £ m refer to the event where the bounds (|36|)-(|38|) hold at epoch m. Then we have the 
following lemmas. The first lemma gives a bound on 71. 

Lemma 3. Suppose that the event £ holds and that the events £i hold for all epochs i = 1,2, ..., m — 1. Then we 
have 


2a 


\ 


m » 

e m \ (u \ s' h^-rr, 

— 2_Jji “ r *-i) r eg i(hi) < —rz~ 

• m. . aL 

2=1 


24a 2 Cm log r„ 


Intuitively, the lemma holds since Lemma |lj allows us to bound reg, (/),,) with reg,_| (h,). The latter is then con¬ 
trolled using the event £i. Some algebraic manipulations then yield the lemma, with a detailed proofs in Appendix [C] 
We next present a lemma that helps us control 71- 
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Lemma 4 . Suppose that the event £ holds and that the events £i hold for all epochs i = 1,2,..., m — 1. Then we 
have 


2a\J 3err m (h*)e m < 


The lemma follows more or less directly from Proposition [2] combined with some algebra. Finally, we present a 
lemma to bound 77;. 

Lemma 5. Suppose that the event £ holds and that the events £i hold for all epochs i = 1,2,..., m — 1. Then we 
have 


\ 


m 

^ {t,-, 


7?? A m 
72 


The reg(h*, hi, Zi-i) terms in the lemma are bounded directly due to the event £,. For the second term, we 
observe that the empirical regret of h relative to h, is not too different from the empirical regret to h* (since h* has a 
small empirical regret by £,). Furthermore, the empirical regret to h* is close to reg,_ | (h, h*) by the event £These 
observations, along with some technical manipulations yield the lemma. 

Given these lemmas, we can now prove the theorem in a relatively straightforward manner. Given our inductive 
hypothesis, the events £, indeed hold for all epochs i = 1,2 ,... , m — 1 which allows us to invoke the lemmas. 
Substituting the above bounds on 7i from Lemma[3] T 2 from Lemma[4]and T 3 from[5]into Proposition [TJyields 


|reg(/i, h*, Z m ) — reg m (/i, h*)\ 
V^r, 


< 4-reg m (h) + + 24a 2 e m logr m + 2a\j 6e m err(ft. m+ i, Z m ) + A m 

+ ^ re g(^*; frm+ij Zm) + 33a 2 e m + -reg m (/i, h*) H--f 4A n 


< ^reg m (h,h*) + 57a 2 e m log r m + + 2a\/ 6e m err(/i m+ i, Z m ) + 5A„ 

+ -reg(/i*, h ra+1 1 z m ) 


Further recalling that c\ > 2a \/6 and C 2 > 57a 2 by our assumptions on constants, we obtain 


\reg(h,h*,Z m ) - reg m (h,h*)\ < ^reg m (/i, h*) + ^A m + 6A ro + ^reg(/i*, h m+1 , Z m ). (39) 


To complete the proof of the bound (36 1 , we now substitute h = h m +i in the above bound, which yields 


\tteg m {hm+i,h*) - -^reg (h,h*,Z m ) < ^A m + 6A m . 

Since h* £ Ai for all epochs i < to, we have feg^(/i, h*) > reg (h, h *) > 0 for all classifiers h £ 7T. Consequently, 
we see that 


reg(h*,h m+1 , Z m ) = -Yeg(h m+1 ,h*, Z. r 


, 527? A 

,) < —' A„ 

J ~ 360 


24 

5 


+ — A m < -7 A m , 


(40) 
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where the last inequality uses the condition 38?7 > 1728. We can now substitute this back into our earlier bound ( |39| ) 
and obtain 


|reg (h, h*,Z m ) - reg m (h, h*)\ 

< *reg^(/i, h*) + ^A m + 6A m + ^ A m < lfegZ(h, h*) + | A m , 

where we use the condition 77 /144 > 6 . This completes the proof of the first part of our inductive claim. 

For the second part, this is almost a by product of the first part through Equation ( |40| ). Recalling that 7 > r//4 by 
assumption, this ensures that h* £ A m+ 1 . 

We next establish the third part of the claim. This is obtained by combining our bound ( |40| with Proposition [2] 
We have 


I err m {h*) - err(/i m+ i, Z m )\ < 611 + ^21 + reg(/i*, h m +i, Z m ) 

, err m (/t ) 3A m t/A m 

- 2 2 4— 

err m (h*) tjA. m 

~ 2 2 ’ 


since r] > 6 . This completes the third part. 

Finally, note that our analysis has been conditioned on the event £ so far. By Lemma I Pr(£ c ) < 5, which 
completes the proof of the theorem. 

We now provide a proof for Theorem [T] 


7.2.2 Proof of Theorem Q] 

We only prove the first part of the theorem. The second part is simply a restatement of the inequality ( fT7| i in Theorem[5] 
The first part is essentially a restatement of (36 1 in Theorem[5] except the bound uses A*, instead of A m . In order to 
prove the theorem, pick any epoch m < M and h £ A, n+1 . Because h* £ Aj, 1 < j < m + 1, we have by LemmaJI] 
that 

re g (h) < reg m(h,h*). 

It then suffices to bound reg ^(h, h*). By the deviation bound (|36[i, we have 


reg m (h, h*) < reg (h, h*,Z m ) + -reg m (h, h*) + |A m 

1 - T) 

— ^m+1? ^m) + h ) ~ 

< ^reg h *) + (7 + |) A m . 


Rearranging terms leads to 


reg m (h,h*) < 47 A m 
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because 7 > p/4. Now we show that A m < 4AJ/, which leads to the desired result. It is trivially true for m = 1 
because A* = Ai. For m> 2, by the deviation bound on the empirical error ([38]) we have 


A m — Ci^jem ^ 2 err m {h* ) T ^ A m ^ "F C 2 C m log T m 

/ 2 ^ 

< 2ci^/e m efrm(fe*) + y 1 ” A m “I - 7~m 

< 2ci \/e m erf m (/i*y + ^ + c 2 e m log T m 

< 2A* 4- — 

— ' 2 ’ 

where the last inequality uses our choice of constants cfp/4 < c 2 . Rearranging terms completes the proof. 


8 Conclusion 


In this paper, we proposed a new algorithm for agnostic active learning in a streaming setting. The algorithm has strong 
theoretical guarantees, maintaining good generalization properties while attaining a low label complexity in favorable 
settings. Specifically, we show that the algorithm has an optimal performance in a disagreement-based analysis of 
label complexity, as well in special cases such as realizable problems and under Tsybakov’s low-noise condition. 
Additionally, we present an interesting example that highlights the structural difference between our algorithm and 
some predecessors in terms of label complexities. Indeed a key improvement of our algorithm is that we do not always 
need to query over the entire disagreement region-a limitation of most computationally efficient predecessors. This is 
achieved through a careful construction of an optimization problem defining good query probability functions, which 
relies on using refined data-dependent error estimates. 

We complement our theoretical analysis with an extensive empirical evaluation of several approaches across a 
suite of 22 datasets. The experiments show both the pros and cons of our proposed method, which performs well 
when hyperparameter tuning is allowed, but suffers from lack of robustness when we fix these hyperparameters across 
datasets. Such a comprehensive empirical evaluation on a range of diverse datasets has not been previously done for 
agnostic active learning algorithms before to our knowledge, and is a key contribution of this work. 

We believe that our work naturally leads to several interesting directions for future research. As the example in 
Section 4.2.2 reveals, the worst-case label complexity analysis in Theorem[2]is rather pessimistic. It would be inter¬ 
esting to obtain sharper characterization of the label complexity, by exploiting the structure of the query probability 
function over the disagreement region. This would likely involve understanding more fine-grained properties that 
make a problem easy or hard for active learning beyond the disagreement coefficient, and such a development might 
also lead to better algorithms. A limitation of the current theory is the somewhat poor dependence in Theorem|4]on 
the number of unlabeled examples needed to solve the optimization problem. Ideally, we would like to be able to use 
G(T m ) unlabeled examples to solve (OP) at epoch m, and improving this dependence is perhaps the most important 
direction for future work. Finally, while AC is extremely attractive from a theoretical standpoint, a direct implemen¬ 
tation still seems somewhat impractical. Obtaining theory for an algorithm even closer to the practical variant OAC 
would be an important step in bringing the theory and implementation closer. 


Acknowledgements 

The authors would like to thank Kamalika Chaudhuri for helpful initial discussions. 


References 

Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. 
In Conference on Learning Theory, pages 288-316, 2013. 


23 









Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd 
international conference on Machine learning, pages 65-72. ACM, 2006. 

Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th 
annual conference on Learning theory, pages 35-50. Springer-Verlag, 2007. 

P. Bartlett and S. Mendelson. Gaussian and Rademacher complexities: Risk bounds and structural results. Journal of 
Machine Learning Research, 3:463-482, 2002. 

A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009. 

A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010. 

Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross- 
validation. In Proceedings of the twelfth annual conference on Computational learning theory, pages 203-208. 
ACM, 1999. 

R. M. Castro and R.D. Nowak. Minimax bounds for active learning. Information Theory, IEEE Transactions on, 54 
(5):2339 -2353, 2008. 

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. 
Information Theory, IEEE Transactions on, 50(9):2050-2057, 2004. 

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15:201-221, 
1994. 

S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing 
Systems 18, 2005. 

S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007. 

D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1): 100-118, February 1975. 

S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009. 

Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3): 
131-309,2014. 

D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. J. Amer. 
Statist. Assoc., 47:663-685, 1952. ISSN 0162-1459. 

Daniel J. Hsu. Algorithms for Active Learning. PhD thesis. University of California at San Diego, 2010. 

S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming algorithms. In 
Advances in Neural Information Processing Systems 21, 2009. 

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin 
bounds, and regularization. In Advances in neural information processing systems, pages 793-800, 2009. 

Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI 2011, Proceedings of 
the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011, pages 
392-399,2011. 

Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. 
Res., 11:2457-2485, December 2010. 

A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32:135-166, 2004. 

Chicheng Zhang. A simplified treatment of oracular CAL. Personal communication, 2015. 

Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in 
Neural Information Processing Systems, pages 442-450, 2014. 


24 



A Deviation bound 


We use an adaptation of Freedman’s inequality [Freedman 1975| as the main concentration tool. 


Lemma 6. Let X 1; X 2 ,..., X n be a martingale difference sequence adapted to the filtration J-). Suppose there exists 
a function b n of X\, ..., X n that satisfies 

VI < i < n, \Xi\ < b n , 

1 f b n b maXl 

where 6 max is a non-random quantity that may depend on n. Define 

n 

S n := 

i =1 
n 

V n := ^E[X 8 2 | J^]. 

1=1 

Pick any 0 < S < 1/e 2 and n > 3. We have 

■ (s n > 2y/V n log(l/<5) + 3 b n log(l/<5)^ < 4v^(2 + log 2 6 max ) logra. 


Pr I 


Proof. Define rj := 2 J for — 1 < j < m := [log 2 h max ]. Then we have 


Pr I 


(S n > 2 v / Klog(l /6) + 3 b n log(l/<5) 

m 

= E Pr ( Sn - 2 V / K i log(l/<5) + 3 b n log(l/<5) A 7-j_ 1 < b 

3=0 

m 

< E Pr ( Sn - 2 V / Klog(l/(5) + 3r,_! log(l/<5) A b n < 

3=0 

< f> r[S n >2 ] jv n 2 


r ■ 

n — 1 j 


3=0 


log(l/5) +3r . 1 °g( 1 / <5 ) 


j 2 ^ b n < rj 


< E4(logn)-\/5 

3=0 

< 4v / <5(2 + log 2 ^max ) logn, 


(41) 


where ( |4T| ) is a direct consequence of Lemma 3 of |Kakade and Tewari | 2009| . Kakade and Tewari 1 2009 1 and the 
others result from simple algebra. □ 

B Auxiliary results for Theorem [1| 

Before presenting our regret analysis, we first establish several useful results. 

Lemma 7. The threshold defined in 0 and the minimum probability P m in,m defined in 0 satisfy the following for 
all m > 1, 


3~m—l^m—l — Tm^mi 


Prr 


min ,m Pmin.m- (-1? 

< A m . 


If 


(42) 

(43) 

(44) 
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Proof. Notice that 


Tm-iCm-i = 32(log(|’H|/<5) + logr m _i) 
< 32(log(|'H|/<5) + logT m ) 

— 7~m€m • 

We first prove |42|. It holds trivially for m = 1. For m > 2 we have 

7"m— 1 A m _1 


(45) 


= c i V 'Tm_ 1 em- 1 err(/i m , Z m -i) + c 2 T m _ie m _i logr m _i 


5 ) Ci a / (T m _ie m _i)r m _ierr(/i m+ i, 1) 4 “ c 2 T m _l^m—1 log 1 


< Cl a/ (r m e m )r m err(fi m+ i, Z m ) + c 2 r m e m logr„ 




Avhere the first inequality is by the fact that h m minimizes the empirical error on Z m _ 1 and the second inequality is 
by Tm-itm-i < T m e m . Then for ( |43| ), it is easy to see 


/r m _ierr(/t m ,Z m _i) 

ne M 


+ log T m _i 


< 


< 


' i~m—icrr(/i m _|_i 5 Z m — 1) 

ne M 


+ log T m _l 


I T m err(h m+1 

, Zrn) 

ne M 


log 


for m > 1, implying P m i n ,m > -?min,m+i- Finally to prove ([44]), we have that 


If 


< 


min,ra+l 


= max 

< max 

< Am., 


r m e^err(/i m+ i, Z m )/( ne M ) + e m log t„ 


C 3 


, 2e^ 


T.err(/i m+ i, Z m ) + e m logr„ 
C 3 


- 2r 7l 


Avhere the second inequality is by r m e m < neM, and the third inequality is by our choices of ci, c 2 and C3. □ 

We also need a lemma regarding the epoch schedule. 

Lemma 8. Let r m - 1 < r m < 2r rra -i/oc all m > 1. Then we have for all m > 1, 

m 

^2+1 T~i 


E 


< 41ogT m+ i, 


y^(7~i - Ti_i)Aj_i < 4 r m A m log' 
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Proof. Note that we can rewrite the summation in question as 


E 


'T’i+l T~i 
Ti 


EE 

i ■ .-I '2 

2=1 J=Ti +1 


< 


E E 


2 

T»+l 


where the second inequality uses our assumption on epoch lengths. The summation can then be further bounded as 

m 

T %-\-1 Ti 


E 

i=1 


m Ti+i 

A E E ?^E? 


2=1 j=r<+l 


3 


i=1 


< 2(1 + logr m+ i) (46) 

< 41ogr m+ i, 

where the third inequality is by the bound X^T=iV*— 1 + logn, and the final inequality is by 1 < logr m , m > 1. 
To prove the second bound in the lemma, we write 


m 

E( Ti ~ r *-i) A *-i 
2 — 1 


m—1 

= ri A 0 + ^2 (r i+ 1 - Ti) A,- 
2=1 
m—1 

= TiA 0 + -— TjAj 

E El^o H - (2 H- 2 log T , m ) / r m A m 
< (2 log ri -2)nAi + (2 + 21ogr m )r m A m 

E (2 log T m 2)7~ m A m -K (2 H- 2 log T m )T m A m 
= 4r m A m logr m , 


where the first inequality is by ( [46] ) and t* A* < r m A m (Lemma |7J, the second inequality is by our choice of Ao and 
the fact that t\ Ai < 1, and the third inequality again uses r^A* < r m A m . □ 


C Proofs omitted from Section TL2\ 


We now provide the proofs of the lemmas and propositions from Section [7A| that were used in proving Theorem |T| 
We start with proofs of Lemmas [T] an 

Proof of Lemma [l] 

Pick any m > l,/i £ PL and h £ A rn . Note that the definitions of regand reg (h,h) only differ on 
X D m := DIS(A m ), and MX <£ D m , h(X) = h m {X). We thus have 


h) - reg (h,h) 


= Ea-.y 


t(X £ D m )((t{h{X) ± h m {X )) - 1 (h(X) £ h m {X))) 


~(l(h(X)^Y)-l(h(X)^Y))) 

= E x ,y[HX i D m )(t{h{X) ± h m (X)) - (l(h(X) ± Y) - 1 (h m (X) £ F)))]. 


The desired result then follows from the inequality that 


1 {h{X) ± Y) - 1 (h m (X) ± Y) < 1 (h(X) ± h m (X)). 
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□ 


Proof of Lemma [2] 

Our proof strategy is to apply Lemma [6] to establish concentration of properly defined martingale difference se¬ 
quences for fixed classifiers h, h! and some epoch to, and then use a union bound to get the desired statement. First we 
look at the concentration of the empirical regret on Z m . To avoid clutter, we overload our notation so that I), = D m (i\, 
hi = and Pi = P m ( l! when i is the index of an example rather than a round. 

For any pair of classifiers h and h', we define the random variables for the instantaneous regrets: 


Ri ~ l(X i ^D i )(l(h(X i )^h i (X i ))-t(h'(X i )^h i (X i ))) + 

1 (Xi G Di)(l(h(Xi) £ Yi) - 1 (ti(Xi) ± Yi))Qi/Pi{Xi) 

and the associated cr-helds P, \= a({Xj, Yj, Q_/}* =1 ). We have that Ri is measurable with respect to T t . Therefore 
Ri — Wj\R t | Ti- 1 ] forms a martingale difference sequence adapted to the hltrations F,, i > 1, and 

E[i?i | Pi_ r] = reg^ (i) 0 ,h') 

according to ( |32| > and the fact that X,. Y, , Q, are independent from the past. To use Lemma [6] we first identify an 
upper bound on elements in the sequence: 


| Ri-M[Ri | Ti- 1]| = \Ri - reg^ (i) (/i,/i , )| < max(fi i ,reg^ (i) (/i, h')) 

< 


1 1 

< 


P ■ ~ P ■ 

x min,m( 2 ) 1 min,m 


for all i such that ni(i) < m, where the last inequality is by Lemma[7] The definition of P m i n ,m implies that 


1 


Prr 


< max(\/ r m -i/(ne M ) +logr m _i,2) < 2y/ r m _i + 1 


because ticm > 1. Then we consider the conditional second moment. Using the fact that 

(l(h(X ! ;)^F l )-l(h , (X l )^F l )) 2 < 1 (h(Xi)^h'(Xi)), 


we get 


< E 


E[(A - E[A | T-x }) 2 | Ti-x] 

E[(fli -reg^M')) 2 I Ri-i] < HRi I ?i- 1] 
1 (X f G Di)Q^ 2 


HX t i Di) 


E 

E 

E_y 

E.y 


l(Xi i Di) + 


1(X i A) + 


Pi(Xi) 

Ci G Di 
Pi{Xif 
t(X t G Dj) 
Pi(Xi) 
t(X G Di) 


1 (h(Xi) / h'(Xi)) | T-i 


1 {Xi i Di) + 1{X p^ 2 Ql ) HKXi) ± h'(Xi)) I Ti-x 


1(X ^ -D m (i)) + 


Pi(X ) 

1(X G -D m (i)) 


Pm(i){X) 


HHXi) ± h'(Xi)) I T-x 
1 (h(X) ± h'{X)) 

1 (h(X) ± h\X)) 


(47) 


(48) 


(49) 


(50) 
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where the last two equalities are from the fact that Xj is independent from the past and replacing our overloaded 
notation respectively. Lemma [6] with ( |47j ), ( |48[ i, and ( [50| then implies for any 0 < S m < 1/e 2 and m > 1, the 
following holds with probability at most 8\/5m(2 + log 2 (2^/T m _i + 1)) logr m : 


|reg(/i, h',Z m ) - reg m (/i, h')\ 


> 


\ 


41og(l/<5 m ) 


y^(Tj - Tj_ i)Ex 


m i=l 

41og(l /8m) 
ttLrniri m 


MX i A) 


MX e A) 


1(M*) / /»'(*)) 


(51) 


Then we consider the concentration of the empirical error on the importance-weighted examples. Define the 
random examples for the empirical errors: 


Et 


QiMKXj) ± Yj a x, g A) 
Pi{Xi) 


and the associated cr-fields T, := a({ Xj, Yj. Qj }*•_■,). By the same analysis of the sequence of instantaneous regrets, 
we have E t — E[A | T, _ \ ] is a martingale difference sequence adapted to the filtrations X, , i > 1, with the following 
properties: 


E[Ei | Fi-i] = E[l(Xj e A A h(Xi) ^ Yj) | J/-i] = err m(i) (/i), 
|Ej - E [Ej | J)-i]| < -< 1 


-*min ,m(i) 


P . 

1 min.m 


< 2 \/T m _i + 1, 


for all * such that m(i) < m. Furthermore, 


Epi-E^IAilHAi] < E 

= E X ,Y 


MXi e A a ft(Xi) ^ *5) 


l(Ie AAkffl^y) 


Ei— 1 


A (A') 


With these properties, Lemma|6]then implies for any 0 < <5 m < 1/e 2 and m > 1, the following holds with probability 
at most 8AA(2 + log 2 (2- s /'fm-i + 1)) logr m : 


|err(ft, Z m ) - err m (/i)| > 


1 


41og(l/5 m ) 


- n- i)E A -,r 


i(Ae aa/i(I)A) 


A (A) 


41og(l/5 m ) 
ttFrnin. m, 


(52) 


Setting 


5m = 


/L92|H| 2 r^( 1 ogr rn ) 2 / 

ensures that the probability of the union of the bad events and ( [52] ) over all pairs of classifiers h, b! and m > 1 is 
bounded by 5 > 0. Choosing S < |/-\/l92, we have 


log(l/5 m ) = 


< 

< 


192|7f| 2 T^(logr m ) 2 ^ 

2(21og(|H|/5) + 41ogr m +logl92) 
8(log(|'A|/<5) + log r m ), 
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leading to the desired statement. □ 

We then provide the proofs of PropositionsfT]and[2] 

Proof of Proposition [l] By the inequality ( |34| > of Lemma[2] we have 


|reg(/i, h*,Z m ) - reg m (h, h*)\ 



dev m (h) 

We now control the term dev m (ft) in order to establish the proposition. We have 


(53) 


—d ev m (ft) 

m 

= y^( r i ~ fi-i)Ex 
2 = 1 
m 

< ( T * - 


1(A g Dj) 

W) 
MX £ A) 


+ i(*£ A) i(/t(i)^*(X)) 


(t(h(X) ± ft, (A)) + l(ft*(A) ± ft, (A))) 


Pi(X) 

+ MXi D t )Mh(x) ^ h*(x)) 


m 

< - n-ij^x 2a 2 t(X € Di)(l{h{X) ± fu(X)) + 1 (h*(X) ± ft. 


2=1 


+ 2/3 2 7T i _iA i _i(reg(/i, Z^) + reg (ft*, Z,_i)) + 2 £t 1 _ :l A 2 _i 


+ l(ft(A) ft* (A) A) 


where the second inequality uses our variance constraints in defining the distribution Pi for classifiers ft and ft*. Note 
that 


l(ft(A') ^ h*(X)) < 1 (ft(A) ^ y) + l(ft*(A) ^ y) 

= (i(ft(A) ^ y) - i(ft*(A) ^ y)) + 2 i(ft*(A) ? y), 


so that the final inequality can be rewritten as 
—dev m (ft) 


< J2(Ti - T,_l 


2a 2 (reg i (ft) + 2reg,(ft l )) + 12a 2 err,(ft*) + 2/3 2 7T i _iA,_i(reg(ft, Z,_i) 


+ reg (ft*, ^-r)) + 2 £t,_ 1 A 2 _i + E x [l(ft(A) ^ ft* (A) A A g A)] 


With the assumptions a > 1 and ft* £ A t for all epochs i < m, the first term regj(ft) can be combined with the last 
disagreement term and bounded by 2a 2 regf (ft). Further noting that t,_i A,_i < r m A m by Lemma]?] we can further 
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simplify the inequality to 


—dev TO (ft) < 2a 2 y^ - Tj_i)regf(ft) + 4a 2 y(T» - T»_i)reg i (/i i ) + 12r m a 2 err m (ft*) 

£m i=i i=i 

m 

+ 2 ^ 2 7 T m A m y (t* - Tj_i)(reg(/i, Zj-i) 

2=1 

m 

+ reg(/i*,Zj_i)) + 2£^(ri - n-ih-iA^j. 

*=i 

The first summand is simply 2a 2 T rn feg^ n (h) by definition. The final summand above can be bounded using Lemmas 
and [8] since 


771—1 


771—1 


- Ti_i)rj_ iA 2 ^ = y (r i+ i - t»)t»A 2 < r m A m y (r i+ i - Ti)Aj 


< 4r^A^logr m . 

Substituting the above inequalities back, we obtain 

771 

—dev m (ft) < 2a 2 r m feg^(/i) + 4a 2 y (-7 - Ti-ijreg^/ij) + 12r m a 2 erf m (ft*) 

£m i=i 

771 

+ 2/3 2 7T m A m y (t, - Tj_i)(reg(/i, Zj_i) + reg(ft*, Zi_i)) + 8 £t^A^ logr„ 
2=1 

Since %/a. + & < -^/a + a/ 6, we can further bound 

\J dev m (/i) < y /, 2a 2 e rn feg^(/i) + 2a^ 


\ 


— y (ri - Ti_i)regj(/ij) + 2ai/3err m (ft*)e 


‘T~771 ■ i 
2=1 


’\ 


27 e m A m y (r, - Tj_i)(reg(/i,Zi_i) + reg(ft*, Zj_i)) 
2=1 

T 2A m y/2^r m e m log T m . 

Substituting this inequality back into our deviation bound ( |53| , we obtain 

|reg(7i, ft*, Z m ) - reg^(ft, ft*)| 

+ y /, 2a 2 e m feg^(ft) + 2a 


< 


P\r 


— y (r* - Tj_i)reg f (/ii) + 2a\/3err m (ft*)e 

\ 


\ 


27 £ mA m y (t 4 - Tj_i)(reg(ft, Zj_i) + reg(ft*, Z 2 _i)) + 2A mX /2^r m e m logT„ 


2=1 
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We can further use Cauchy-Schwarz inequality to obtain the bound 


|reg(/i, h*,Z m ) - reg m (h, h*)\ 


< -reg m (h) + 2a 2 e m + 2a A 


- - Tj_ 1 )reg i (/i i ) + 2a-y/3err m (/i*)e 

\ Tm *=1 


+ / 


\ 


27 e m A m ^(ri - r,_i)(reg(/i, Zj_i) + reg(/i*, ^_i)) + 2A m y 7 2£r m e m logr„ 


+ 


Prr 


< ^reg m (/i) + 2a 2 e m + 2a A 


— VVj - Tj_i)regj(/ij) + 2av / 3err m (/i*)e 

\ Tm i=i 


+ /3 


1 


2je m A rn YXn - Tj_i)(reg(/i, Zj_i) + reg(/i*, Zj_i)) + A m + 


< ^ r eg m (/i) + 2a 


\ 


- Tj_i)regj(/ti) + 2a v /3err m (/i*)e 


+ l 


\ 


2je m A m YXn - Tj_i)(reg(/i, Zi_i) + reg (h*, Zi-i)) + 4A„ 

i=l 

where the last two inequalities use our assumptions on £ and a respectively. 


□ 


Proof of Proposition [2] We start by observing that 

|err m (h*) - err(ft m+ i,Z m )| < |err m (/i*) - err(/i*, Z m )| + reg(/i*, /i m +i, Z m ). 

Since h* £ Ai for all epochs i < m, we know that h* agrees with all the predicted labels. Consequently, err(h ", Z rn ) = 
err (h*, Z m ), where we recall that Z,„ is the set of all examples where we queried labels up to epoch to. This allows 
us to rewrite 


I err m {h*) 

Under the event S, the above deviation is 


- err(/i*, Z m )\ = |err m (h*) - err(h*,Z m )\. 

bounded, according to Lemma[2] by 


N 



r,-i)Ex,y 


t(h*(X) ^Y,X £ Di) 

Pi(X ) 




< 



err m (h*) 
p . 


€-m 


P . 

1 min,m 


where the inequality uses the bound Pi(X) > P ]u \„., for all X £ Di and P min ,i > P m in,m for all epochs i < to by 
Lemma[7] A further application of Cauchy-Schwarz inequality yields the bound 


I err m (h*) - erc(h*,Z m )\ < 


< 


err m (/i*) 3e„ 
2 + 2 P mil 

erf m (h*) , 3A m 
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Combining the bounds yields 


I-: tU*\ (U <7 \\ S' ®rr m (/t ) 3A m /,* , 7 \ 

orr m (/t ) err(/i TO _|_i, ^ ~ d - ~ d - reg(/r , /tm+ij An)? 

which completes the proof of the proposition. □ 

Finally, we prove Lemmas [3]to[5]used in the proof of Theorem [T] 

Proof of Lemma [3] We first bound the reg, ( h t ) terms. For * = 1, we have 

regi(/ii) = ieg(hi) < 1 < ^ 

by Fmin,i = 1 and our choices of r/ and Ao. For 2 < * < m, we have 

reg i(hi) = E x , y [1 (K(X) ^ Y,X € A) - t(h*(X) £ Y,X £ A)] = reg(A) < feg^(/ij, A), 

where the second equality uses the fact that A € A, for alii < m by inductive hypothesis ([9]) and the inequality uses 
Lemma[l] Consequently, we can bound reg ?: _ i (A) using the event £ t , since reg (A, A, A_i) = 0. The event £, now 
further implies that 


reg ,{hi) < reg^A, A) < 2reg(A, A,d- 


v\-i 


< 


Using this, we can simplify 71 as 


Ti=2a 


\ 


— ^(n - Tj_i)reg i (/ij) < 2a 

Tm »=l 

< 2,a\J2r\e m X m logr, 

V^m , o/i ^.2 


\ 


~ “ T i-r) 




*=i 


< 


12 


d- 24a e m logr m . 


here the second inequality is by Lemma[8]and the third inequality is by Cauchy-Schwarz. 


(54) 


(55) 

□ 


Proof of Lemma[4] We first invoke Proposition [2j whose assumptions now hold due to the claim A £ A, in £, for 
all i < m, and obtain 


6rr m (/i ) — 2err(/i m +i, Z m ) -f- 3A m -T 2reg(/i ,^ m ). 

The above inequality allows us to simplify 72 as 

^2err(/i m+ i, ^m) T 3A m -T 2reg(/i*, 

< 2a^6e ro err(/i m+ i, Z m ) + 2a\/9e m A m + 2a^/ 6e m reg(A, An+i, Z m ) 

A 2a 6e m eir(/tm-t-i, d - A m d - ~reg(/t • /tm-t-i^ An) A 33a e m j (56) 

where the last inequality uses the Cauchy-Schwarz inequality. □ 

Proof of Lemma |5] 

Observe that the event £, gives a direct bound of r/Ai_i/4 on the reg (A, A, A-i) terms. For the other term, 
recall by the same event that for all h £ 'H and for all / = 1,2 ..., m — 1, 


T 2 = 2a\/3e m err m (/i*) < 2a J 
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reg (h,h*,Zi) < yreg^/i, h*) + yA*. 

Combining with the empirical regret bound for h *, this implies that 


reg {h,Zi) < ^reg Z (M*) + yA,. 


Consequently we have the bound 


T 2 < /3 2 7A m e m ^(r, - Tj_ 1 ) ^3reg f _ 1 (/i, ft*) + yA 

To simplify further, note that by the definition of reg)(/i, h*) and our earlier definition of reg* (h, h*), we have 

m m— 1 i 

^(Ti-Ti.^feg^r^/i,/!*) = J2 T,+1 T ‘ 55 ( r j - Ty- 1 )reg*(/t, h*) 


i= 1 


i =1 
m— 1 


f=i 


= 51 “ T f-l) re §j'(^! ft*) 55 


m —1 

.t/r j,*\ Ti + 1 “ r * 


1=1 


m —1 


< 4logr m 55 ( T i - r f-i) re gj(ft, ft*) 

1=1 

< 4r m logr m feg^(/i,/i*), 

where the first equality uses our convention reg 0 (h, h*) = 0 and proper index shifting, and the first inequality uses 
Lemma [ 8 ] We also have 

m 

55( T * ~ Tj_i)Aj_i < 4r m A m log T m . 

i= 1 

by Lemma [ 8 ] Consequently, we can rewrite 

^3 55 P (l 2 T m log T m reg m (/i, ft- ) + 6 T m ?7 log rVn A m ) 

= P 2 lT m e m log T m A m (l2reg^(/i, /i*) + 6 ??A m ) 

< r)A m ieg^{h,h*) ^ 2 A 2 m 
72 144 ’ 

where the last inequality is by our choice of /3 such that P 2 'yne n logn < ry/864. Taking square roots, we obtain 

lr]A m Kg^ 1 {h,h*) r] 2 A 2 m 


t 3 < 


1 , 


72 144 

7t/A m 


< 4 re g m (ft,ft*)+ 72 


(57) 

□ 


D Label Complexity 

Here we prove Theorem[2] We start with the following simple bound on the total number of label queries: 

n / n \ 

55 Qi < max I 3, 55 e An(i)) ) (58) 
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by the fact that Algorithm[I]queries only the labels of points in the disagreement region. The random variable 1 ( X, £ 
Dm{i)) is measurable with respect to the er-field T % := a({Xj, Yj, Qj}j—i), so 

Ri := 1 (Xi £ D„ j(j)) — Ej[l(A'j £ D m ^)\ 


forms a martingale difference sequence adapted to the filtrations T, ,i> 1, where Ej[-] := E[- | Ti-\}. Moreover, we 
have \Ri\ < 1 and 


E i[Ri] < Ei[l(Ai € D m(i) )]. 

Applying Lemma 3 of Kakade and Tewari |2009] with the above bounds and Cauchy-Schwarz, we get that with 
probability at least 1 — <5, 


Vn > 3, ^l(Aj £ D m ^) < 2^Ej[l(Aj £ D m(i) )] -f-41og(4(logn)/<5). (59) 

i—1 i-1 


We next bound the sum of the conditional expectations. Pick some i and consider the case X, £ D m ^. Let m := m(i) 


for the ease of notation. Define 



h'iXt) ± h*(Xi), 


where 


h m := argminerr(/i, Z m _i), (60) 

hen 

h! := arg min err(/i, (61) 

h£HKh{.x i )±h m {x i ) 


Because Xi £ D m := DIS(,4 m ), we have b! £ A m , implying h £ A rn . Conditioned on the high probability event in 
Theorem[5] we have h* £ A rn and hence 

Pr x (h(X)^h*(X)) = Pv x (h(X)^h*(X)AX £D m ) 

< reg m (L) + 2 err m (/i*) 

< I 67 A* m _ 1 + 2 err m (/i*), 

where the last inequality is by Theorem[I| This implies that 

Xi £ DIS({ft | Pr x {h(X) ^ h*{X)) < 16 7 A^_ 1 + 2err m (h*)}). 


We thus have 


E,[1 (Xi £ DIS(A m ))] < Ej[1 (Xi £ DIS({h | Pr x (h(X) ± h*(X)) < 16 7 A^_ 1 + 2err m (/i*)}))] 

< 0(16 7 AJ^_ 1 + 2err m (/i*)), (62) 


where the last inequality uses the definition of the disagreement coefficient 

, = gup Pr x ({A | 3h£H s.t. Pr x (h(X) ± h*{X)) < r,h*{X) ± h(X)}) 

r> 0 T 

Summing ( |62| over i £ {1,..., n } and noting that the high probability event in Theorem [5] holds over all epochs, we 
get that with probability at least 1 — <5, 


Vn > 3, ^E i [l(l l efl m(i) )] < 
2=1 

< 


M 


3 + y>,- - t j _i) 0(16 7 A*_ 1 + 2err j(h*)) 

j =2 


M 

3 + 2n0efr M (/i*) + 16 7 6 | ^(rj — tj- i )A*_ 1 
i=2 

M , , 

3 + 2n6evf M {h*) + 16 7 0^ —- 

j =2 T ' 1 
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A similar argument as Lemmaj7]shows that t 3 A* is increasing in j, so we have by a further invocation of Lemma|8] 

n 

^E^Xi e < 3 + 2n0eff M (/i*) + 128 7 0(n-l)A^_ 1 log(n-l) 


i=1 


= 3 + 2n9eiiM(h*) 


+90 neiT M {h*) ^log (^) log 2 n + log 3 nj + log (^) log 2 n + log 3 


Combining this and (|59|) via a union bound leads to the desired result. 


E Proofs for Tsybakov’s low-noise condition 

We begin with a lemma that captures the behavior of the terms, err m (h*) and the probability of disagreement 
region under the Tsybakov noise condition ( fT0| >. The proofs of Corollaries[2]and[4]are immediate given the lemma. 

Lemma 9. Under the conditions of Theorem^ suppose further that the low-noise condition (jTOjl holds. Then we have 
for all epochs m = 1, 2,..., M 

2 ( 1 —') 2 ( 1 —) 

eiT m (h*) < ce m logT m T m 2- “ , and err m (h*) < 5ce m log" r m r m 2 ~" ■ (63) 


Proof. We will establish the lemma inductively. We make the following inductive hypothesis. There exists a constant 
c > 0 (dependent on the distributional parameters) such that for all epochs 3 > 1- the bounds ( [63] ) in the statement 

of the Lemma hold. The base case for j = 1 trivially follows since erri(/i*) = effi(/i*) = err (h*) < 1 < 

2(1 — ) 

cei log 7”i 2_ “ , which is clearly true for an appropriately large value of c. Suppose now that the claim is true for 

epochs j = 1,2,..., m — 1. We will establish the claim at epoch m. To see this, first note that we have 


err m {h*) = Pr(l (h*{X) ^ Y,X € D m )) < Pr(X € D m ). 

Under the noise condition, we can further upper bound the probability of the disagreement region, since by Theorem[l] 
we obtain 


Pr(X e D m ) = Pr(X £ DIS(A m )) < Pr (X £ DIS ({h £ U : reg (h) < 16 7 A^_ 1 )) 

< Pr {X £ DIS (h £ n : Pr(h(X) ± h*{X)) < f (16 7 A^_ 1 )“)) , 

where the first inequality follows from Theorem [T] and the second one is a consequence of Tsybakov’s noise condi¬ 
tion ©■ Recalling the definition of disagreement coefficient ( |TTj ), this can be further upper bounded by 

Pr(X £ D m ) < Of (64) 

Hence, we have obtained the bound 

err m (h*) < 9( (WyA^^. 

Note that A* m __ 1 = Ci\/e m _ieff m _i(/i*) + C 2 e m -i logr m _i. Our inductive hypothesis ( |63| allows us to upper 
bound the Wr m -i in this expression for A* m _ 1 and hence we obtain 
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2(1—up 

^m-i < Cl \l €m-i 5ce m _i log 2 T m _ 1 r m 2 Si + c 2 e m -i logr m _i 


l-o. 

. 2 - 0 . 


< Cie m _i logr m Tm “ y/5c + c 2 e m _i logr m _! 

< £ ' ,iTm log r m ( CiVScTm"" +C 2 
Tm-1 


l\/5C' 

< 2e m log T m ( CiA/ScTm"" + C 2 ) . 


Since r TO > 3 and 0 < w < 1, we can further write 

< 2 e m log T m Tm~ w (civ / 5c + c 2 ^) . 

Substituting this inequality in our earlier bound on err m (h *) yields 


(65) 


err m (/i*) < |^327e m log r m “ (civ / 5c + c 2 

Since e m r m logr m > 1 and 0 < oj < 1, we can further bound 


l-o. 
_ 2 -o. 


-1 / 

err m (h*) < 0(e m T m logr m ^327 t™“" (ciV5c 

= 9C,e m T m log T m (^327 (ciV^C+C^ Tm 
= 9(e m Tm~“ logr m ^327 (ciVbc + c-^J 
< ce m log Tm Tm 

Here the last bound follows for any choice of c such that 


c 2 


c>9( ^327 (civ^ + c*)) • 

The above inequality has a solution since the LHS is smaller than the RHS at c = 0, while for c large enough, the LHS 
grows linearly in c, while the RHS grows as c“( 2 , and hence is asymptotically smaller than the LHS. 

We now verify the second part of our induction hypothesis for epoch to. Note that we have 


m 

err m{h*) = — ^(r, - Tj- X )evvj(h*) 


3 =1 


< — - Tj_l) C€j log 


T > T J 


2(l — o;) 
2-03 


3 =1 


= tY. 


(jj ^ T J-l) , 

--- CCj Tj log Tj Tj 


3=1 J 

We now observe that 77 is clearly increasing in j, and so is Tjej by definition. Consequently, we can further upper 
bound this inequality by 
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1 ( \ 2(1 — w) 

err m (h*) < — e m r m log T m ^ —-—- ctv 2_ “ 

, 7~j 

3 =1 ^ 


(a) 

< ce m log r m r, 


2 ( 1—0 

2 -w 


( T j - r i-i) 


= ce m log r m r m 2 “ 1+2^ 


i + V v - 7 , - 7 ~ 

u 

m— 1 


< ce m log r m r 7 


2(1 —a;) 
2 —w 


3 = 1 
m —1 


i + E 


fa+i - t j) 

r i+i 
( T i+i - Li) 


i=i 


where the inequality (a) holds since r ; - is increasing in j and lu £ (0,1] so that the exponent on t 3 is non-negative, and 
the final inequality follows since Tj < r )+1 . Invoking Lemma[8] we obtain 


2(1 —qj) 

err m (h*) < e m logr m (1 + 4clogr m ) r m 2 "“ 

2(1 —^j) 

< 5ce m log 2 r m r m 2_ " , 

where we used the fact that 1 < logr m . Therefore, we have established the second part of the inductive claim, 
finishing the proof of the lemma. □ 

Using the lemma, we now prove the corollaries. 

Proof of Corollary [2] Based on the proof of Lemma [9] we see that A*, satisfies the bound ( |65[ i. Plugging this into 
the statement of Theorem[l]immediately yields the lemma. □ 


Proof of Corollary|4] Based on the proof of Lemma [9] we see that the probability of the disagreement region 
follows the bound (|64]>. Substituting the bound (|65j) yields the stated result. □ 


F Analysis of the Optimization Algorithm 

We begin by showing how to find the most violated constraint (Step[3]i by calling an importance-weighted ERM oracle. 
Then we prove Theorem [3j followed by the framework and proof for Theorem [4] 
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F.l Finding the Most Violated Constraint 


Recall our earlier notation I™(x) = 1 (h(x) 7A h m (x) A x G D m ). Consider solving (OP) using an unlabeled sample 
S of size u. Note that Step [3]is equivalent to 


arg min/jg-H 


= argmin fteW 


= argmin h£W 


= arg min heH 


= arg min hen 


bm(h) - Ex 


I™(X) 


Px(X) 

27- l)A m _ierr(/i, Z m _i) + Ex 

27 P 2 {Tm - 1) A m _!err(/i, Z m _ 1 ) 


+Ex 


2 a 2 - 


Px(X) 


IJ^(X) + max 


2 a 2 - 


Px(X) 


Px(X) 


Ph(X) 

-2a z ,o) 1 (XeD m ) 




27 P 2 (Tm - l)A m _ierr(/i, Z m _i) 


+Ex 


+Ex 


max 2 a 2 — 


Px(xy 


0 i(A g D m )i(h(X) ± h m (X)) 


max 


- 2a 2 , 0 ) 1(A G D m )t(h(X) ± -h m {X)) 


Px(X) 


27 P 2 {T m - l)A ro _ierr(/i, Z m _i) 

+Ex [|sa(A)|1(A g D m )t(h(X) ± sign(sx(A))/i m (A'))], 


( 66 ) 


where sa(A) := 2a 2 — 1/Pa (A). In the above derivation, the second equality is by the fact that the extra term added 
to the objective is independent of h and hence does not change the minimizer. The third equality uses a case analysis 
on the sign of «a(A) and the identity 1 — l(h(X) 7 ^ h m ( A)) = t(h(X) 7 A — h m (X)). The last expression suggests 
that an importance-weighted error minimization oracle can find the desired classifier on examples {(A, Y*, W)} with 
labels and importance weights defined as: 


Y* := arg min c( A, Y), 

W := \c(X, 1) — c(A, —1)|, 


where 

c(A, Y) 


2 7 /3 2 A m _i ( 1 (jy<e p" ( ;ffy y<)q< + i D m(i) Af / h m{i) (Xi)^ , 
±|s a (A)|1(A G D m )l(Y ^ sign(sA(A))/i m (A)), 


A — Xi G Z m _ 1 , 
A G S. 


(67) 


F.2 Proof of Theorem |3] 

Where clear from context, we drop the subscript m. 

We first show that each coordinate ascent step causes sufficient increase in the dual objective. Pick any h and A. 
Let X be identical to A except that \' h = A/, + S for some A > 0. Then the increase in the dual objective V can be 
computed directly: 


V{X)-V{\) 

= AExpTPO] + 2Ex[l(A G D m )(^q x {X) 2 + 61^{X) 

fsi™(x) s 2 ij?(xy 


> <5ExPT(A)] + 2Ex 
= ffix If ' 1 

= s(e x 


qx(A) 


V2qx(A) 2 


1 + qx(X) 
'I-(A) 


T£{X)-b{h) 


Px(X) 


- b(h) ) - — E 


- A 2 E 

Ph(xr 

qA(A) 3 . 


8qx(A)4 
J-(A) 23 


4qx(A)3 


q A (A))] - Sb(h) 
- 6b(h) 


( 68 ) 


(69) 
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The inequality ( [ 68 ] ) uses the fact that \J T 
theorem). The lower bound 


z>l + z/2- z 2 /8 for all z > 0 (provable, for instance, using Taylor’s 
on the increase in the objective value is maximized exactly at 

E[lJ?(X)/P x (X) - b(h)} 


5 = 2- 


(70) 


E A [I™(X) 2 /q A (X) 3 ] 

as in Step <[7j. Plugging into ( |69| , it follows that if h is chosen on some iteration of Algorithm |2]prior to halting then 
the dual objective V increases by at least 

E x [l]?(X)/P x (X)-b(h)] 2 


>eV 


(71) 


E*[X™p0 2 /q A (X) 3 ] 

since q A (;r) > /x, and since E X [Z™(X)/P\(X) — b(h)] > e. 

The initial dual objective is 77(0) = (1 + /i) 2 Pr(Z7 m ). Further, by duality and the fact that P{X) = 1/2 is a 
feasible solution to the primal problem, we have 17(A) < 2(1 + /x 2 )Pr(Z7 m ). And of course, rescaling can never 
cause the dual objective to decrease. Combining, it follows that the coordinate ascent algorithm halts in at most 
Pr(Z7 m )(2(l + fi 2 ) — (1 + /x) 2 )/(e 2 /x 3 ) < Pr(Z7 m )/(e 2 /x 3 ) rounds proving the bound given in the theorem. 

By this same reasoning, the left hand side of ( |7 T] i is equal to 6 ■ E x [F™(X)/P\(X) — b(h)], which is at least 5e. 
That is, the change on each round in the dual objective V is at least e times the change in one of the coordinates A h- 
Furthermore, the rescaling step can never cause the weights A h to increase. Therefore, e|| A||i is upper bounded by the 
total change in the dual objective, which we bounded above. This proves the bound on j| A|| i given in the theorem. 

To see ( fl 6 | l, consider first the function g(s) = T>(s ■ A) for A as in the algorithm after the rescaling step has been 
executed. At this point, it is necessarily the case that s = 1 maximizes g over s £ [0,1] (since A has already been 
rescaled). This implies that g'( 1) > 0 where </ is the derivative of g; that is. 


0 <</(!) =E 




-E A hKh). 


(72) 


P,. x(X) 

IL 

Now let F(P) denote the modihed primal objective function in ( fl2| ) and let F* denote the optimal objective value 

- /x 2 E A - 


F* := inf E a 
p 


1 


[1-P(X)\ 


t(x g D m ) 


P(X) 


s.t. 


P satisfying < 0 > and \/x £ X 0 < P(x) < 1. 


Then we have 


F(P k ) < 


< 


< 


F(P k ) + J2^(Kx 

h ' 

inf C(P, A) 

0<P(x)<l 

sup inf C(P, A) 

A > 00 < P ( a ;)< 1 


' Z?(X) 

Px(X)\ 


-b(h) 


(73) 


(74) 


(75) 


(76) 


Here, ( f74| follows from ( f72] >; ( [75] ) by the definition of I\(X) as the minimizer of the Lagrangian. To establish ( [76] ), 
first notice that the following holds for all feasible P and all non-negative A: 

inf C(P,X)<C(P,X)<F(P) 

0<P(x)<l 

by the definition of the Lagrangian m This implies 

inf C(P, A) < F* 

0<P(x)<l 


for all non-negative A, leading to ( |76| . Then we have 

1 


E 


1 -^POJ 


<F(P k )<F* < f*+gPiiD m ). 
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F.3 Proof of Theorem |4] 

For e > 0, define A e := {A £ : A > 0, ||A||i < 1/e}. We begin with a simple lemma. 

Lemma 10. Suppose (j): RxA —► K be L-Lipschitz with respect to its first argument, and x) < R 

for all A £ A e and x £ X. Let Ex[-] denote the empirical expectation with respect to an i.i.d. sample from Fx- For 
any S £ (0,1), with probability at least 1 — <5, every A £ A e satisfies 


Ex 

<t> ( e wpo, x) 

-Ex 

f f E W(X), x J 


l \heH / J 


L / \ 


£ V U V U 


Proof. Let x £ {0,1} H denote the vector with Xf, = 1 (h(x) h rn (x)), and define the linear function class 


F := {i i-> (A, x) : A £ A e } . 

By a simple variant of the argument by|Bartlett and Mendelson |2002|, with probability at least 1 — 5, 


E. y 


E W(X), x 


\h£U 


— E.y 


E W(X), x 




< 2 L ■ IZu^J 7 ) + R ■ 


ln(l /S) 


for all A £ A e , where 'lZ u (fF) is the expected Rademacher average for the linear function class T for an i.i.d. sample 
of size n. By Kakade et al.||)2009), this Rademacher complexity satisfies 


n u (F) < 


1 /2 In \H\ 


This completes the proof. □ 

Lemma 11. Pick any 5 £ (0,1). Let E_y[-] denote the empirical expectation with respect to an i.i.d. sample from Fx- 
With probability at least 1 — 5, every A £ A e satisfies 


E, 


i-PaPOJ 


-E, 


[l-Px(X)\ 


< 


l 2\n\H\ | / (p? + 1/e) ln(3/5) 

p 2 £ 2 U V U 


and for all h £ R, 


E 




Px(X) 


-E 




Px(X) 


< 


I 2In \H\ /ln(3|7f |/5) 


p A £ 2 U 


p 2 u 


ln(6|"H|/5) 


2 u 


and 


Ex[l^(X)} - E x [lJf(X)] 


< 


ln(6|~H|/5) 
2 u 


Proof. Observe that 1/(1 — P\(x)) = 1 + q\{x) for all A £ A e and x € X. Now we apply Lemma 10 to 
the function <j>i(z,x) := \Jp? + z, which is (2^) _1 -Lipschitz with respect to its first argument. Since q\(x) = 
f^hen {x), x) < \Jp, 2 + 1/e for all A £ A £ and x £ X, Lemma 10 implies that, with probability at least 

1 - 5/3, 


E 


x 


i-PaPOJ 


-E 


x 


1 - Px(X) 


< 


p,£ 


1 /2 In \H\ 


(p 2 + 1/e) ln(3/5) 


VA £ A £ . (77) 
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Next, observe that for every h £ TL and x £ X, 




= m*) + 


ZT(*) 


P\{x) q\{x) ' 

By Hoeffding’s inequality and a union bound, we have with probability at least 1 — <5/3, 


ExprrPO] - E-yPTPO] 


< 


ln(6|-H|/<5) 
2 m ’ 


V7i £ P. 


( 78 ) 


Now we apply Lemma 10 to the functions fa{z , x) := T™(x)/y/ii 2 + 2 for each h GTl', each function fa is (2/i 2 ) -1 - 
Lipschitz with respect to its first argument. Furthermore, since faiYlhen ^h-P™{x),x) = IJf(x)/q\(x) < 1/p for 
all A £ A e and x G X, Lemma 10 and a union bound over all h G 77, implies that, with probability at least 1 — <5/3 


E y 

pw0] 

- Ex 

prmi 

< 



L 3a(AT) J 


1 1x(X) J 


y p 4 e 2 u y 


'ln (3| H\/S) 
p 2 u 


VA £ A e , h G U. 


(79) 


Finally, by a union bound, all of ( |77j ), ( |78| , and ( |79| ) hold simultaneously with probability at least 1 — S. □ 

We can now prove Theorem[4] We first state a slightly more explicit version of the theorem, which is then proved. 

Theorem 6. Let S be an i.i.d. sample of size ufrom the Px- Suppose Algoritlmi^is run on the m-th epoch for solving 
(0P5 e ) up to slack £ in the variance constraints. Then the following holds: 

1. AlgorithnnAhalts in at most Jp I J' Drn \ iterations, where Pr (D m ) := s G Dm)/u. 

2. The solution A > 0 it outputs has bounded l\ norm: 

||A||i < P r(D m )/s. 


3. There exists an absolute constant C > 0 such that the following holds. If 


u > C- 



log |^| 



iog(V*A 
e 2 J ’ 


then with probability at least 1 — 6, the query probability function P« ( x) satisfies 

• All constraints of (op) except with slack 2.5e in constraints 

• Approximate primal optimality: 


E a - 


1 


< /* + 8P m in im Pr(D m ) + (2 + 4P m i nim )e, 


where f* is the optimal value of (op) defined in G3- 

Theorem [4] is just a result of some simplifications in the O(-) notation in the above result. We now prove the 
theorem. 

Proof of Theorem [6] The first two statements, finite convergence and boundedness of the solution’s t \ norm, can 
be proved with the techniques in Appendix |F.2| that establish the same for Theorem[3] We thus focus on proving the 
third statement here. 

Let Ex[-] denote empirical expectation with respect to S. Hoeffding’s inequality implies that with probability at 
least 1 — 6/2, 

E x [t{X £ D m )} < E x [l(X £ D m )} + e. (80) 
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Also, Lemma [TT| implies that with probability at least 1 — <5/2, 




U-PaPOJ 


-E.y 


1-PaWJ 


< e, VA £ A e/2 ; 


E x [I^(X)] - Ex[Z?(X)] < e/ (8a 2 ), \/h £ H; 


Ex 


' ThiX) 

Px(X)\ 


-Ex 


T™(X) 


L^aWJ 


^ e/4, VA G A £ /2; h £'H. 


(81) 

(82) 

(83) 


Therefore, by a union bound, there is an event of probability mass at least 1 — 5 on which Eqs. @T), ((82), <(83) 
hold simultaneously. We henceforth condition on this event. 

By Theorem^ A satisfies || A||i < 1/e, the bound constraints in <[6), as well as 


E, 


Zh(X) 

L^POJ 


<b m (h) + 2e, Vh£'H , 


and 


E 


x 


Applying ([82) and 


1-4WJ 


to 


< E x 


l-P e *(X) 


+ 4P m in iTrl Ex'[l(Ar £ D m )] 


(84) 


(85) 


where P* is the optimal solutiorjj to (OPs iE ). We use this to show that J’ x is a feasible solution for (OP2.5 e ), and 
compare its objective value to the optimal objective value for (op). 

i gives 


Ex 


1™(X) 


P'dX) 


< b m (h) + 2.5e, Vh£H. 


Since Pc also satisfies the bound constraints in <(6), it follows that P x is feasible for ((J l ) 2.r, E ) - 
Now we turn to the objective value. Applying ((80) and <(81) to <(85) gives 


Ex 


1 


i -PdX) 


< Ex 


l-P*(X) 


+ 4P m i njm Ex[l(A' £ D m )\ + (1 + 4P m i n!m )e. 


( 86 ) 


We need to relate the first term on the right-hand side to the optimal objective value for (op). 

Let A* be the output of running Algorithm [2] for solving (op) up to slack e/2. By Theorem [3] A* satisfies 
II A* ||i < 2/e, the bound constraints in <[6), as well as 


Ex 


and 


Ex 


Zh(X) 

PxdX)\ 
1 


Li-Pa-PqJ 


< b m (h) + £/ 2, V/i G P, 

< f* +4P min , ro Ex[l(IsB m )]. 


Applying < (81) to < (87) , we have 

Ex 


1-Pa*W 
And applying < (82) and < (83) to < (87) gives 

Ph\X) 


< f* + 4P min , m Ex[l(X G D m )\ + e. 


Ex 


PxdX) 


< b m (h) + e, \/h £ H. 


(87) 


( 88 ) 


(89) 


8 Note that on a finite sample S, the primal optimization variables P(x) are in a compact and convex subset of RP , and therefore an optimal 
solution can always be attained. 
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Table 2: Binary classification datasets used in experiments 


Dataset 

n 

s 

d 

r 

titanic 

2201 

3 

8 

0.323 

abalone 

4176 

8 

8 

0.498 

mushroom 

8124 

22 

117 

0.482 

eeg-eye-state 

14980 

13.9901 

14 

0.449 

20news 

18845 

93.8854 

101631 

0.479 

magic04 

19020 

9.98728 

10 

0.352 

letter 

20000 

15.5807 

16 

0.233 

ijcnnl 

24995 

13 

22 

0.099 

nomao 

34465 

82.3306 

174 

0.286 

shuttle 

43500 

7.04984 

9 

0.216 

bank 

45210 

13.9519 

44 

0.117 

a9a 

48841 

13.8676 

123 

0.239 

adult 

48842 

11.9967 

105 

0.239 

w8a 

49749 

11.6502 

300 

0.030 

bio 

145750 

73.4184 

74 

0.009 

maptaskcoref 

158546 

40.4558 

5944 

0.438 

activity 

165632 

18.5489 

20 

0.306 

skin 

245057 

2.948 

3 

0.208 

vehv2binary 

299254 

48.5652 

105 

0.438 

census 

299284 

32.0072 

401 

0.062 

covtype 

581011 

11.8789 

54 

0.488 

rcvl 

781265 

75.7171 

43001 

0.474 


This establishes that A* is a feasible solution for (OPs jE ). In particular. 


E x 


1 


1-P*(X) 


< Ex 


l-Px*(X)_ 

< f* + 4P min , m Ex[l(X G D m )\ + £ 


where the second inequality follows from (|88j). We now combine this with (|86|) to obtain 


Ex 


1 

.1 ~Px(X). 


< f* + 8P m i nim Ex[l(Ai G D m )] + (2 + 4P m i nim )e. 


□ 


G Experimental Details 

Here we provide more details about the experiments. 

G.l Datasets 

Table [2] gives details about the 22 binary classification datasets used in our experiments, where n is the number 
of examples, d is the number of features, s is the average number of non-zero features per example, and r is the 
proportion of the minority class. 
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G.2 Hyper-parameter Settings 


We start with the actual hyper-parameters used by OAC. Going back to Algorithm[T| we note that the tuning parameters 
get used in mostly the following three quantities: 7 A, _j , a and /3. We use this fact to reduce the number of input 
parameters. Let Co := 7 2 Ci32(log(|'H |/<5) + log(i — 1)) (treating log(i — 1) as a constant) and set 77 = 864, 7 = 77/4 
and C 2 = 77 c 2 /4 according to our theory. Then we have 


7 A 1-1 = V 7 2 Cie i _ierr(/i i , 1 ) + 7 c 2 ei_i log(f - 1) 


'c 0 err(/ii,Z,;_i) c 2 log(i - 1) 
+ c 0 - 


i - 1 


) -1 1 

7C1 2—1 


where — = c\ = O(a). Based on this, we use 


Aj_i 


7 — 1 

in Algorithm [3] in place of yA,.!. Next we consider 

1 


Ic 0 err{hi, Zi-x) log(* - 1) 

+ max(2a, 4)cq —— 


i - 1 


(90) 


r < 


= O 


216ne n logn 
7 2 G 

216co logn 
a 

c o, 


V ne n « co/(7 2 ci) by treating logn as a constant 
by again treating log n as a constant and C\ = O(a). 


Based on the last expression, we set /3 := X 0 , where scale > 0 is a tuning parameter that controls the influence of 

Pscale 

the regret term in the variance constraints. In sum, the actual input parameters boil down to the cover size Z, a > 1, Co 
and Pscale, an d we use them to set 


7^i-l : = 


c 0 err(/i l ,Z i _i) log(z - 1) \/a/co 

- + max(2a, 4)c 0 —;--—, p = 


i-l v ’ 7 u i-l 

Finally, we use the following setting for the minimum query probability: 

. 1 

Mnin,* — min 


A 


cale 


(i - l)err(/tj, 1 ) + log(« - 1) 


Next we describe hyper-parameter settings for different algorithms. A common hyper-parameter is the learning 
rate of the underlying online oracle, which is a reduction to importance-weighted logistic regression. For all active 
learning algorithm, we try the following 11 learning rates: HP 1 ■ {2 -2 ,2 -1 ,..., 2 8 }. Active learning hyper-parameter 
settings are given in the following table: 


algorithm 

parameter settings 

total number of settings 

OAC 

(CO; Z, Pscale) cZ) € 

{0.1 • {2 -10 ,..., -2 -1 }, 0.1,0.3,..., 0.9,2°,. 
13.6.12.24.481 x G/TO! x 111 

■, 2 4 } x 

100 

IWAL 0 

C 0 G {O.l ■ {2 -17 , 2 -16 ,..., 2°}, 2°, 2 1 ,. 

■, 2 4 } 

23 

ORA-IWALo 

C 0 €{2- 17 ,...,2 5 } 

23 

IWALi 

Go the same as IWALo 

23 

ORA-IWALi 

Go the same as ORA-IWALo 

23 


Good hyper-parameters of the algorithms often lie in the interior of these value ranges. 
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G.3 More Experimental Results 

We provide detailed per-dataset results in Figures [4] to [7] Figures [4] and [5] show test error rates obtained by each 
algorithm using the best fixed hyper-parameter setting against number of label queries for small (fewer than 10 5 
examples) and large (more than 10 5 examples) datasets. Figures [h] and [7] show results obtained by each algorithm 
using the best hyper-parameter setting for each dataset. 
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-IWAL 0 

- IWALj 

.... ORA-IWALo 

- ORA-IWALi 

-PASSIVE 
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10 10 ° 
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number of label queries 
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10 1 10^ io d 
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10 10 ° 
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a9a 


lO^ 10* 3 
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adult 




number of label queries 



number of label queries 



Figure 4: Test error under the best fixed hyper-parameter setting vs. number of label queries for datasets with fewer 
than 10 5 examples 
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test error test error test error 


x 10 


bio 


maptaskcoref 


activity 




number of label queries 


number of label queries 


number of label queries 


skin 



number of label queries 


vehv2binary 



census 



number of label queries 


covtype 



rcvl 



number of label queries 


number of label queries 


Figure 5: Test error under the best fixed hyper-parameter setting vs. number of label queries for datasets with more 
than 10 5 examples 
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Figure 6: Test error under the best hyper-parameter setting for each dataset vs. number of label queries for datasets 
with fewer than 10 5 examples 
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Figure 7: Test error under the best hyper-parameter setting for each dataset vs. number of label queries for datasets 
with more than 10 5 examples 
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